On November 22, a few days after OpenAI fired (and then re-hired) CEO Sam Altman,The Information reportedthat OpenAI had made a technical breakthrough that would allow it to “develop far more powerful artificial intelligence models.” Dubbed Q* (and pronounced “Q star”) the new model was “able to solve math problems that it hadn’t seen before.”

Reuterspublished a similar story, but details were vague.

Both outlets linked this supposed breakthrough to the board’s decision to fire Altman. Reuters reported that several OpenAI staffers sent the board a letter “warning of a powerful artificial intelligence discovery that they said could threaten humanity.” However, “Reuters was unable to review a copy of the letter,” and subsequent reporting hasn’t connected Altman’s firing to concerns over Q*.

The Information reported that earlier this year, OpenAI built “systems that could solve basic math problems, a difficult task for existing AI models.” Reuters described Q* as “performing math on the level of grade-school students.”

Instead of immediately leaping in with speculation, I decided to take a few days to do some reading. OpenAI hasn’t published details on its supposed Q* breakthrough, but it*has*published two papers about its efforts to solve grade-school math problems. And a number of researchers outside of OpenAI—including at Google’s DeepMind—have been doing important work in this area.

I’m skeptical that Q*—whatever it is—isthecrucial breakthrough that will lead to artificial general intelligence. I certainly don’t think it’s a threat to humanity. But it might be an important step toward an AI with general reasoning abilities.

In this piece, I’ll offer a guided tour of this important area of AI research and explain why step-by-step reasoning techniques designed for math problems could have much broader applications.

## The power of reasoning step by step

Consider the following math problem:

John gave Susan five apples and then gave her six more. Susan then ate three apples and gave three to Charlie. She gave her remaining apples to Bob, who ate one. Bob then gave half his apples to Charlie. John gave seven apples to Charlie, who gave Susan two-thirds of his apples. Susan then gave four apples to Charlie. How many apples does Charlie have now?

Before you continue reading, see if you can solve the problem yourself. I’ll wait.

Most of us memorized basic math facts like 5+6=11 in grade school. So if the problem just said, “John gave Susan five apples and then gave her six more,” we’d be able to tell at a glance that Susan had 11 apples.

But for more complicated problems, most of us need to keep a running tally—either on paper or in our heads—as we work through it. So first we add up 5+6=11. Then we take 11-3=8. Then 8-3=5, and so forth. By thinking step-by-step, we’ll eventually get to the correct answer: 8.

The same trick works for large language models. In afamous January 2022 paper, Google researchers pointed out that large language models produce better results if they are prompted to reason one step at a time. Here’s a key graphic from their paper:

This paper was published before “zero-shot” prompting was common, so they prompted the model by giving an example solution. In the left-hand column, the model is prompted to jump straight to the final answer—and gets it wrong. On the right, the model is prompted to reason one step at a time and gets the right answer. The Google researchers dubbed this technique chain-of-thought prompting; it is still widely used today.

If you read ourJuly articleexplaining large language models, you might be able to guess why this happens.

To a large language model, numbers like “five” and “six” are tokens—no different from “the” or “cat.” An LLM learns that 5+6=11 because this sequence of tokens (and variations like “five and six make eleven”) appears thousands of times in its training data. But an LLM’s training data probably doesn’t include any examples of a long calculation like ((5+6-3-3-1)/2+3+7)/3+4=8. So if a language model is asked to do this calculation in a single step, it’s more likely to get confused and produce the wrong answer.

Another way to think about it is that large language models don’t have any external “scratch space” to store intermediate results like 5+6=11. Chain-of-thought reasoning enables an LLM to effectively use its own output as scratch space. This allows it to break a complicated problem down into bite-sized steps—each of which is likely to match examples in the model’s training data.