The Strawberry Problem
Through 2024, you could hand a frontier language model a task from a graduate coding interview and watch it pass. Then you could ask:
“How many r's are in the word strawberry?”
And it would answer, with complete confidence: two.
A system that had read more text than every human in history combined, that could write a working compiler, could not count three letters in a ten-letter word. The failure was so famous it got a name: the strawberry problem.
Hold the puzzle open for a second. What could possibly make a machine that has read billions of pages unable to see letters? Not unwilling — unable.
Why did models that write working code miscount the r's?
Pick one — committing first is what makes the answer stick.
the lesson continues after you choose
The comfortable conclusion — “so they're not actually intelligent” — feels like it settles something.
But it explains nothing: the same system passes harder tests. The real answer is stranger and more useful. The model wasn't failing to think about the word. It was never shown the word — at least, not the way you see it.
Before any text reaches a language model, a tokenizer chops it into chunks called tokens — common words become one token, rarer ones split into pieces, and each chunk is replaced by a number. “Strawberry” might arrive as st + raw + berry — three IDs from a vocabulary of a couple hundred thousand. The letters are gone. The model then turns each ID into an embedding — a long list of numbers encoding what the chunk means and how it relates to other chunks — and everything downstream operates on those.
The boundary explains a family of oddities: why models flub arithmetic on long numbers (digits get grouped into arbitrary tokens), why rhyme and wordplay are shaky, and why non-English text often costs more tokens — and therefore more money and latency — for the same meaning: tokenizers are trained mostly on English, so other scripts fragment into more pieces.
And the 2026 update: ask a current reasoning model about strawberry and it answers correctly — but not because it can suddenly see letters. It learned to work around its own senses: spell the word out step by step (turning one token into many single-character tokens it can count) or quietly call a tool. The boundary didn't move. The model learned it has one — which is more than most of its users know.
So the strawberry problem was never a stupidity test — it was an anatomy lesson you could run from a chat box. The puzzle inverts: the surprise isn't that a brilliant model failed a trivial task; it's that the task and the model lived on opposite sides of a boundary — your world is made of letters, its world is made of tokens — and 'trivial' doesn't cross that border.
Your rule: when a model fails somewhere bizarre, ask “does this task live below the token level?” If it involves exact characters, digits or positions, expect trouble — and route around it: ask the model to spell things out step by step, or give it a tool (a calculator, a script) instead of trusting its senses.