NEXT-TOKEN PREDICTION Contradiction 7 min

One Word at a Time

01 · THE SETUP

Two statements. Both are true.

One: a large language model generates text one token at a time, each chosen from a probability distribution over what comes next. Mechanically, it is autocomplete.

Two: in July 2025, systems built exactly this way produced gold-medal solutions at the International Mathematical Olympiad — five of six problems solved, 35/42 points, graded by official IMO coordinators. Only 67 of 630 of the world's best students scored gold that year.

Autocomplete does not prove theorems. Autocomplete proved theorems. Hold both.

One of those statements is misleading you. Before reading on — which one, and what's the misleading word in it?

02 · YOUR CALL ⏸ YOUR CALL — PICK ONE TO CONTINUE

Where does the contradiction actually break?

If you pick A

Healthy skepticism, and worth checking — which is why it matters that the IMO's own coordinators graded the proofs under competition rules and standards. The result stood. If the facts hold on both sides, the flaw must be in how we're reading one of them.

If you pick B

A reasonable escape hatch, but no: token-by-token prediction is still exactly the mechanism, in Claude, GPT and Gemini alike. The mechanism isn't wrong. Something about what we assume the mechanism implies is.

If you pick C — the mechanism

That's the release valve. 'Autocomplete' smuggles in 'shallow'. But there's no ceiling written into the objective: to predict the next token of a proof, you must model the mathematics; to predict a story's last word, you must track the plot. Prediction is the task — competence is whatever the task demands.

If you pick D

True of 2024 — DeepMind's silver used specialised proof-search systems. That's what makes 2025 the interesting year: the gold-medal runs worked end-to-end in natural language, ordinary LLM-style generation with extended thinking time. The autocomplete did it unassisted.

Pick one — committing first is what makes the answer stick.

the lesson continues after you choose

03 · NOT SO FAST

Full disclosure: the 2023 edition of the book this lesson grew from described LLMs as using “N-grams and statistical analysis” — counting which words tend to follow which. That was the honest textbook answer for half a century of language modelling.

It is also, for modern LLMs, wrong — and wrong in exactly the way that makes the contradiction feel unresolvable. If you think the machine is counting word pairs, a gold medal is impossible. So what replaced the counting?

04 · THE MECHANISM

The loop: read everything so far → compute a distribution over the next token → pick → append → repeat.

The objective really is that simple: given everything so far, predict the next token; append it; repeat. But how the prediction is computed changed species. An n-gram model looks up how often “I love” was followed by “ice” in its corpus — a memory trick. A transformer passes the entire context through dozens to hundreds of layers, each refining a picture of what's being said: which pronoun binds to which noun, whose alibi contradicts whose, what the variable x currently holds. It doesn't retrieve the next word — it computes it from a model of the situation, which is why it completes sentences that have never existed anywhere.

THE PRINCIPLE

Prediction has no ceiling

It means: Next-token prediction is an objective, not a capability level. The better you must predict, the more of the world's structure you're forced to model.
It works through: Predicting a murder mystery's final page requires modelling the plot → predicting working code requires modelling program behaviour → predicting a valid proof requires modelling the mathematics. Scale the model and data, and the objective quietly demands deeper competence.
Spot it when: Whenever someone reasons 'it just predicts text, therefore it can't do X', they're inferring a ceiling from the mechanism. Ceilings only show up in evaluations, never in the objective.

The second act — and the reason 2025 differed from 2023 — is reasoning models. Instead of answering immediately, models like OpenAI's o-series, Claude with extended thinking, Gemini's Deep Think and DeepSeek's R1 first generate thousands of hidden tokens of working: trying approaches, catching their own errors, backtracking. Same next-token mechanism — spent on thinking before speaking. This is “test-time compute”: you can now buy better answers with thinking time, not just bigger models. That's what turned fluent prediction into medal-grade problem solving.

05 · BACK TO THE OPENING

So the contradiction was never between the two facts — it was inside the word just. “Just autocomplete” is true the way “the brain is just neurons firing” is true: correct mechanism, smuggled conclusion. The opening wasn't asking you to doubt the medal or the mechanism. It was asking you to notice that a humble objective, scaled and given time to think, buys an unhumble capability.

06 · TAKE THIS WITH YOU

Your rule: never infer what a model can or can't do from a description of its mechanism — in either direction. 'It just predicts text' underestimates; 'it reasons like a person' overestimates. Capability claims are settled one way: test it on your actual task, and count.

REFERENCES