But Can ChatGPT Reason?

Jake Browning
Dec 6, 2024
14 min read

Updated: Apr 23

As happens every few months, a new update comes out, folks get a "feel the AGI" buzz, and then the usual debate about LLM reasoning breaks out. But people seem especially bitter at the recent Apple paper—which is a bit of a surprise, since nothing really changed. The chief grouse is that reasoning doesn't have a precise meaning, so "LLMs can't reason" is just the newest way to say, "computers can't really think”—which is often more knee-jerk reaction than argument.

The criticism is fair. But reasoning also does mean something—or, rather, it means a few different things. And it is worth spelling the different meanings out to better appreciate how LLMs are doing—and where they are still lacking.

While AI hypers might be disgruntled by academics and researchers continuing to churn out papers about the failures of LLMs, it is worth noting many of the papers also highlight progress. For example, the Apple paper shows that OpenAI’s o1 model is doing better on some reasoning tasks. Figuring out what reasoning means is essential for appreciating what makes o1 cool—but also why it is no closer to being a reliable system. It also helps clarify how o3 did so well on ARC-AGI while still leaving many AI researchers frustrated that the whole approach is missing the point.

This paper will sketch out three different things we might mean by reasoning. At bottom, they all have the same fundamental concern: getting it right. A reasoning system should arrive at the right answer when posed with a puzzle, whether it is a clever riddle, 2+2, or Raven matrices. If an LLM does get the right answer, it isn’t clear it is reasoning—and if it fails, it isn’t clear it can’t reason. So we need to clarify some points before any of this makes sense.

To start, here are three kinds of reasoning we’re interested in: pattern-matchers, rule-followers, and rule-seekers. Appreciating the quirks of each makes clear why we are curious about whether LLMs can reason—and the immense payoff of a system that can reliably reason.

Finding a Pattern of Reasons

A standard trope in teen TV shows is stealing the answer key to the test. Some of the students then memorize the answers and, as a result, ace the test. This is a clever way to get the right answer, but the students clearly haven’t mastered the subject matter. As a result, if the teacher were to allow a “dupe” answer key to be stolen and then provide an entirely different test when the time came, the cheaters would flop.

The recent paper from Apple has a bit of this character: the math tests broadly didn’t change, but some incidental details—like the names used—did. The performance of the LLMs tested suffered, suggesting some of the students had been memorizing the answer key, rather than learning the material.

But the models didn’t flounder, with their success rate falling to chance. So the models hadn’t just memorized which bubble to fill in on the Scantron but had discerned some underlying patterns that were relevant. So, before we can call it a failure, we need to figure out if it failed for a good reason, or if the machine is just dumb.

There are lots of reasons a failure can happen, both to humans and machines. Many tests of academic skill, such as SATs, simply eliminate questions if too many students get it wrong on the assumption that there is something weird about the question (e.g., requiring culturally-specific knowledge, misleading wording, or even a mistake in how answers are printed). Similarly, if we modify a human test when we give it to machines—as with switching up the names—we often run it by humans to make sure the modifications don’t result in substantially worse scores. We typically assume they won’t, but we are often surprised; humans often succumb to dumb mistakes we don’t see coming.

But the Apple paper is about relatively straightforward math. So if the model is failing in this case, we seem inclined to say they didn’t learn the right underlying patterns—or, at least, didn’t figure out when to follow them. This takes us to the more standard definition of reasoning: rule-following.

The Right Steps at the Right Time

Here's a pretty standard definition of reasoning, taken from Nicholas Shea's new open-access book, Concepts at the Interface:

"[reasoning] is standardly thought as a step-by-step mental process, each step moving from one or two premises to a conclusion, which then forms the basis of the next step."

Thus, when many people say “LLMs can’t reason,” they mean the systems struggle with the process of going through the right steps in the right order.

Let’s focus on mathematics as a clear example of a step-by-step process. The strategy for training LLMs on mathematics was originally to hope they'd just learn the proper reasoning steps just by pretraining—a dead-end. Even simple mathematics—like multidigit addition—tends to involve lots of different steps, for a human or calculator. So if you just ask “what is 132x58,” the model will flop because it needs to truncate all the steps into the next couple tokens. Practically impossible.

But another thing we learned from way back when is that LLMs were learning simple operations. Thus, they could perform basic addition, multiplication, and so on—meaning they could perform each of the steps in a chain of reasoning. So, if you laid out the problem into a series of steps, they could do all the steps. So the issue wasn’t competency, but performance: they could do the right steps at the right time—though they made mistakes, needed some hand-holding, etc.

AlphaGo, Roll out, and Test-time inference

This gave us a clear way to improve the models: break the problem down and keeping it on task. This is where “chain-of-thought” prompting came in: it encouraged the model to spell out explicitly the steps it would take to solve the math problem before beginning and reiterate the steps at each move. Since these steps would be part of the model's input as it went about solving the problem, the hope is that it would use this as a checklist. The hope is that the model would effectively check back, cross each item off the list, and then keep going down until they got to the right answer.

This proved a bit overly hopeful, for two reasons. First, LLMs tend to make lots of mistakes, and the mistakes quickly ramify: a bad answer on step three makes the final answer nonsense. But a clever follow-up is to try a strategy that worked well with AlphaGo: roll out, or "test-time inference." With AlphaGo, the base neural network would sample winning moves from its pretraining on countless good games, and then, at test-time, the model would "roll out" what games would look like if it took specific moves. This allowed it to see which strategies were likely to work, and which ones ended in defeat.

The benefit of this approach for a simple game like go is obvious: the longer it is allowed to "think" (that is, compute possible solutions), the more games it can try out, and the more confident it can be in the eventual move it makes is a good one. Effectively, the longer compute time allows it to explore more of the space of possible games. If the space of possible games is relatively determinate, this works well.

The Challenge of Chain-of-Thought

The roll out strategy also works well in mathematics, but it requires some tweaking. The chain-of-thought process is spelling out, in natural language, the steps which need to be taken to solve some problem. But people don't typically do this, much less at a fine degree of detail: usually you can just tell a student, "Solve this part," assuming they know how to do this.

But, if the problem is that the LLM doesn't do the right thing at the right time, you need to spell out the steps each time. And you also need to come up with the right steps--which means either having humans create chain-of-thought examples for the model to learn on, or rewarding the model when it makes the right chain-of-thoughts. Which involves more training. So test-time inference is not self-running; it instead depends on more training.

And the problem becomes more complex. OpenAI’s new model, o1, also allows the model to keep rolling out answers--laying out good chain-of-thought explanations for what to do next and then taking those steps. But the same model also is trained to say something like, "check your work," so it then critiques its own answers to determine if it is right. This requires more training at self-critique (or, if the anthromorphic language is too weird, at simply talking about its own answers more, which hopefully allows it to discuss possible mistakes in prior steps).

o1 excels at this, typically sampling a bunch of different approaches at test-time, playing them out, talking itself through each one, and hopefully settling in on the best answer (determined by the self-critique winning out, or different systems "voting" on the best answer, or whatever; OpenAI is a bit secretive on these points).

And this all looks like reasoning, and it looks decidedly like human reasoning—at least in one sense. Aside from savants, few of us can perform complex math in our heads. We instead write it down, go through a series of steps we learned in school, check our work at various points, erase the parts that go wrong, stumble around for a bit, doodle a little in the margin, then figure out the last chunk when we remember that one rule we learned in calculus. In short, humans solve math problems in messy ways.

But LLMs have been trained substantially more than humans and still struggle a great deal on these tasks. We also haven’t eliminated tradeoffs, where improving a model on some math tasks doesn’t lead to worse performance on other tasks. In fact, the problem seems to be getting worse. This is surprising because the recent o1 model showed LLMs can often perform tasks appropriate to a mathematics graduate student.

So what gives?

Stop it!

A big part of logic and mathematics isn’t just recognizing what you should do. It is also inhibiting bad responses—basically saying “no!” to our initial intuitions. Simple example: a bat and a ball cost $1.10 together. If the bat costs $1 more than the ball, then how much does the ball cost? People often have a knee-jerk response that the ball costs a dime—which is wrong after a moment’s reflection. We do this a lot, especially if we’re not careful. That’s why teachers often tell us, “read over the questions carefully,” or “be sure to think this through.” The goal is for students to inhibit their initial response, think more slowly, and then answer.

Why inhibition? And what is happening when we slow down? If you think of the mind as consisting of a bunch of different functions, you can think about the inhibition basically shutting down the knee-jerk response and instead demanding a different function take over. It is saying, that pattern isn’t the one that matters. It’s this other pattern. The other function then goes through a different, potentially non-linguistic process: mathematical reasoning, physical modeling, spatial planning, or whatever. In short, the inhibition and slow down is meant to remove all the noise to ensure the model picks out what really matters.

The lack of inhibition really limits LLMs and effectively ensures their performance is worse than we’d like. At heart, even though they are competent at the steps of reasoning, they are still pattern-recognition systems. What we are trying to do with the fine-tuning and retraining is to ensure they focus on the relevant pattern in reasoning cases—the mathematical operations, or modus ponens, or whatever. But, without inhibition, the model often defaults to the apparent, knee-jerk pattern because it is more common.

Simple example:

Well, phooey. This is a problem LLMs face with slightly tweaked riddles, where the most likely pattern (i.e., the riddle in its usual form) is also the wrong one. But it is also just a useful way of seeing a general problem: if the statistics point one way, but reasoning points another, the model will follow the stats most of the time. For riddles or other tricks, this has been dubbed "the illusion-illusion": if it looks like a trick, the model will assume it is a trick--even if the trick is simply that it isn't a trick, but simply looks like one.

Which goes back to the tradeoffs problem: the way LLMs are fine-tuned at mathematics is by helping them recognize a certain kind of problem, one that looks a certain way in the training data. Think about word problems on the SAT. Fine-tuning is telling it, “problems that look like this should be solved this way.” But the same underlying math problem might look very different in a textbook, so the same fine-tuned strategy might misfire. Slight differences to the superficial statistics of how the word problems are phrased can be used to help the model extract the right pattern—or, frustratingly, prevent the model from extracting it.

Inhibition is a good technique in humans: intuitive responses are fine, so long as we recognize that they are knee-jerk and so slow down before acting on them. As Jonathan Haidt put it, “intuitive primacy, not dictatorship.” We have a reaction, but we can slow it down and not act on it. We can admit our intuition says one thing, but we feel uncomfortable or uncertain with the response.

But, for LLMs, there is intuitive primacy—and only intuition all the way down. The chain-of-thought prompting and roll outs don’t change that; they just attempt to create different scenarios where a different knee-jerk response might kick in. Even when the model says it is re-thinking its answer, it is still relying on the same intuition-process for the “re-thinking.” The whole thing is just having the model do more of the same thing, in hopes that doing it lots of times means the model will (1) stumble on the right answer at least once, and (2) recognize the right answer when it stumbles upon it. It sometimes works, but it is an inelegant approach.

Groping for Rules

Suppose we fixed the unreliability. Would LLMs count as reasoning then? It’d be hard to deny they can engage in rule-following. But there are a set of reasoning puzzles they would still struggle with: rule seeking. Wilfrid Sellars (following Kant) thought this was the key feature of being human:

“When God created Adam, he whispered in his ear, ‘In all contexts of action you will recognize rules, if only the rule to grope for rules to recognize. When you cease to recognize rules, you will walk on four feet.’”

Isn’t this just pattern recognition? Yes and no. Yes, because rules just are patterns. But a rule is a condensed pattern, one that captures the essential features of something and composes it into a specific format.

Consider “if a billiard ball hits another ball, the second moves.” Although this rule describes the pattern of one ball hitting another, it also contains concepts, like “if” and “then,” “ball,” “hit,” and “moves,” that can be re-combined with other concepts in productive ways. So a mere visual pattern-recognition system, even if it can predict that the second ball will move if hit, won’t have the condensed, discrete representations for "ball" and "hit" and—as such—will have more difficulty re-applying these in other cases, such as when the ball hits a brick. If you can generate a new concept for "brick" you can quickly generate a new rule: "if a ball hits a brick, the brick does not move." But a visual neural network needs to simply experience the same event numerous times to arrive at the same prediction--and it is not much closer to figuring out the general pattern of the causality between objects.

Language models are in a better position because they can come up with new rules by simply recombining different words. But it isn't their strong suit. It’s easy to train LLMs on math because we already know all the right steps to take at the right time. Other formal domains, like symbolic logic or basic coding, have a similar structure. These all involve numerous rules, and rules about rules, and rules about those rules, and so on. The student doesn’t need to go find the rule; they just need to learn the rules and then recognize it when they see it. For LLMs, this means just repeating prior chain-of-thought solutions to new problems--a relatively simple task.

Rule-seeking, however, depends on finding new rules. Humans do this pretty quickly and easily. It takes only a moment or two to figure out the rules to a video game. And, once you know them, you can scale them rapidly. Consider a game like Elden Ring: early enemies have a certain pattern of attack, like A-B-A. Players learn the pattern and how to respond to it, and then store it as a rule: if you encounter X, use rule R (i.e., the response to A-B-A). A later boss will have a different pattern, like C-D-A-B-A. Once a player has seen it a few times, they’ll recognize the latter part is just the old pattern, A-B-A, so they can apply the rule again: if you encounter Y, then make these two dodges and then use rule R. As the game progresses, the player will end up with more rules such that, when the enemy begins to attack, they can look for which rules—and combinations of rules—might apply. The rules for later bosses end up being really complicated, of course. But they also tend to be combinations of other rules, effectively rewarding you for learning to beat all the other characters.

Deep learning struggles here (though you can modify the system to help; see Voyager). LLMs might do ok at recognizing patterns, but we want them to recognize the pattern as a rule and then encode it in a re-usable way. In a high-profile case, AlphaStar could play extremely well as one species of one map of Starcraft but couldn’t play other species or on other maps. It had learned a good pattern, but it hadn’t condensed any part of the pattern into a simple rule, like, “take the high ground” or “don’t build a base out in the open,” or “don’t send an infantry unit to fight a tank.” Without simple rules, none of their insights on one map transferred.

ARC-AGI and o3

This is why the ARC reasoning test is so valued. The test is simple in presentation: provide two examples of some rule, indicated by a few squares of different colors on a grid, and then require the model to apply the rule in a new case. In many ways, it is easier than older reasoning tasks, like the Bongard problems. But its simplicity makes the failures of current AI on the task all the more obvious. LLMs simply cannot find rules effectively, despite how simple the puzzles are.

It is worth belaboring the point: what we want is for the model to look at two examples of some phenomenon, grasp the rule explaining the examples, and then re-apply it. In short, we don't want them to spend forever searching through solution space until they stumble on the response. Rather, the goal is for it to only sample plausible hypotheses--to make a very narrow search of solution space. The reason we care about the latter, but not the former, is because we are hoping to learn something about "general intelligence"--the weird capacity of humans, seen especially impressively in children, to learn new things rapidly.

This is why people are grumbling about the rapid success on ARC-AGI seen with o3. First, and most obviously, we don't know how o3 did it--which makes it unclear what, if anything, we should infer from its success. Second, o3 also depended on a different test--a modified version of ARC-AGI. Rather than two examples for each problem, researchers created numerous other examples obeying the same rule. So it isn't the "one-shot" learning we are looking for in humans. Third, even with all the example, the model still required a lot of computing time to solve problems. This means it isn't efficiently rule-seeking. It is sampling huge chunks of solution space, rather than quickly zeroing in on the relevant pattern and only sampling plausible rules.

To get a sense of the difference, you can imagine a human solving numerous problems with a single piece of scrap paper containing just a couple crossed out answers--and doing so just by looking at two examples. With o3, it is tons upon tons of examples of each problem, and tons upon tons of scrap paper needed for solving it. Importantly, both approaches get the right answer--but they are doing so in different ways.

What does it mean?

The take-away is that LLMs can reason, if by reasoning we mean rule-following. But it isn’t reliable, it has some real weak spots, and its progress on the challenge set by ARC-AGI is impressive without being informative about the weird kind of reasoning many people are interested in. So it is a mixed-result. It doesn’t justify the humanist skepticism: LLMs can think and reason, at least in some senses. But it also doesn’t justify the boosters: LLMs aren’t reliable reasoners.

This all points to a second issue, however. When someone—like a Google AI user or a CEO for a company—asks, “can it reason?,” it is wrong to say yes. Still, even with o3. When academics ask, our concerns are kind of big picture, with interests in what is working and what isn’t. “Kind of” is the appropriate response there. When users or CEOs ask, they don’t know or care about the nuances about reasoning. They are asking, “if I give this a reasoning problem, can I be confident it’ll handle it right?” And the answer is no: if the prompt is funny, it might fail. If it follows the wrong statistics, it might fail. If it can’t figure out what rule to follow, it will fail. And, frustratingly, the model won’t say “I don’t know.” It fails confidently—and it is getting more confident, leading to stupider failures.

So LLMs have gotten a lot smarter and a lot more competent. But the performance issues are still present and, as long as that is the case, their successes at reasoning are fun for academics—but they don’t change the overall frustrations many of us experience when dealing with these systems.