One Word at a Time
Two statements. Both are true.
One: a large language model generates text one token at a time, each chosen from a probability distribution over what comes next. Mechanically, it is autocomplete.
Two: in July 2025, systems built exactly this way produced gold-medal solutions at the International Mathematical Olympiad — five of six problems solved, 35/42 points, graded by official IMO coordinators. Only 67 of 630 of the world's best students scored gold that year.
Autocomplete does not prove theorems. Autocomplete proved theorems. Hold both.
One of those statements is misleading you. Before reading on — which one, and what's the misleading word in it?
Where does the contradiction actually break?
Pick one — committing first is what makes the answer stick.
the lesson continues after you choose
Full disclosure: the 2023 edition of the book this lesson grew from described LLMs as using “N-grams and statistical analysis” — counting which words tend to follow which. That was the honest textbook answer for half a century of language modelling.
It is also, for modern LLMs, wrong — and wrong in exactly the way that makes the contradiction feel unresolvable. If you think the machine is counting word pairs, a gold medal is impossible. So what replaced the counting?
The objective really is that simple: given everything so far, predict the next token; append it; repeat. But how the prediction is computed changed species. An n-gram model looks up how often “I love” was followed by “ice” in its corpus — a memory trick. A transformer passes the entire context through dozens to hundreds of layers, each refining a picture of what's being said: which pronoun binds to which noun, whose alibi contradicts whose, what the variable x currently holds. It doesn't retrieve the next word — it computes it from a model of the situation, which is why it completes sentences that have never existed anywhere.
The second act — and the reason 2025 differed from 2023 — is reasoning models. Instead of answering immediately, models like OpenAI's o-series, Claude with extended thinking, Gemini's Deep Think and DeepSeek's R1 first generate thousands of hidden tokens of working: trying approaches, catching their own errors, backtracking. Same next-token mechanism — spent on thinking before speaking. This is “test-time compute”: you can now buy better answers with thinking time, not just bigger models. That's what turned fluent prediction into medal-grade problem solving.
So the contradiction was never between the two facts — it was inside the word just. “Just autocomplete” is true the way “the brain is just neurons firing” is true: correct mechanism, smuggled conclusion. The opening wasn't asking you to doubt the medal or the mechanism. It was asking you to notice that a humble objective, scaled and given time to think, buys an unhumble capability.
Your rule: never infer what a model can or can't do from a description of its mechanism — in either direction. 'It just predicts text' underestimates; 'it reasons like a person' overestimates. Capability claims are settled one way: test it on your actual task, and count.