Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Large language models (LLMs) have shown impressive performance on various reasoning and problem-solving tasks. However, there are questions about how these reasoning abilities work and their limitations.
In a new study, researchers at the University of California, Los Angeles, and Amazon have done a comprehensive study of the capabilities of LLMs at deductive and inductive reasoning. Their findings show that while LLMs can be very good at finding the rules of a task from solved examples, they are limited in following specific instructions. The findings can have important implications for how we use LLMs in applications that require reasoning.
Inductive vs. deductive reasoning
Reasoning can be broadly categorized into two distinct types: deductive and inductive. Deductive reasoning, often described as “top-down” logic, starts with a general principle or rule and applies it to infer specific conclusions. For example, when given the formula for converting Celsius temperature to Fahrenheit, you can use it to calculate new measurements.
Inductive reasoning, on the other hand, takes a “bottom-up” approach. It involves observing specific instances or examples and drawing general conclusions or patterns from them. For example, you can observe several Celsius and Fahrenheit measurements on a thermometer and try to infer the formula that converts one to the other.
Both types of reasoning are essential for intelligence but involve different cognitive processes. And while LLMs are often evaluated on their reasoning abilities, most research doesn’t make a clear distinction between their inductive and deductive capabilities.
A new framework for testing LLM reasoning
The researchers at Amazon and UCLA designed a series of experiments to evaluate the inductive and deductive reasoning capabilities of LLMs. To ensure a fair and consistent comparison, the experiments used a similar task structure across different contexts, with each context specifically emphasizing either deductive or inductive reasoning.
For instance, in an arithmetic task, the researchers tested the LLMs’ ability to apply a given mathematical function to solve problems (deductive reasoning) and their ability to infer the underlying mathematical function from a set of input-output examples (inductive reasoning).
To further disentangle inductive reasoning from deductive reasoning, the researchers developed SolverLearner, a two-step framework that isolates and evaluates the inductive reasoning process in LLMs.
SolverLearner first prompts the LLM to generate a function that maps input data points to their corresponding output values based solely on a set of input-output examples. This step focuses on the LLM’s ability to learn the underlying pattern or rule from the data.
In the second step, SolverLearner uses an external code interpreter to execute the proposed function on new test data. This separation ensures that the LLM is not involved in applying the function, preventing its deductive reasoning abilities from influencing the evaluation of its inductive reasoning.
“By focusing on inductive reasoning and setting aside LLM-based deductive reasoning, we can isolate and investigate inductive reasoning of LLMs in its pure form via SolverLearner,” the researchers write.
LLMs show contrasting strengths in inductive and deductive reasoning
The researchers used SolverLearner to evaluate the inductive and deductive reasoning capabilities of GPT-3.5 and GPT-4 across various tasks, including syntactic reasoning, arithmetic operations, and spatial reasoning.
The results showed that both LLMs consistently exhibited remarkable inductive reasoning capabilities, achieving near-perfect accuracy on tasks that required them to learn from examples and infer the underlying mapping function.
However, the LLMs struggled when tasked with applying specific rules or instructions, especially when those instructions involved scenarios not commonly encountered during their training. This is especially true for “counterfactual” reasoning tasks that are different from conventional cases. For example, the LLMs perform well on deductive reasoning involving base 10 arithmetic but perform very poorly on unconventional numerical bases, such as 11 and 9.
The findings suggest that LLMs might be better at learning by example and discovering patterns in data than at following explicit instructions. This has important implications for the use of LLMs in real-world scenarios. While on the surface, LLMs might show impressive abilities to follow logical instructions, there is a great chance that they are just following patterns they observed during their training, which means their performance will degrade as soon as the examples they see deviate from their training distribution.
On the other hand, SolverLearner provides a framework that ensures the model learns the correct rules that map the inputs to the outputs. However, SolverLearner is only applicable in settings where a verification mechanism such as a code interpreter is available.
This study is a sobering reminder that we have yet a lot to learn about the abilities of these black boxes that are becoming part of a growing number of applications.