TOKENIZATION Puzzle 5 min

The Strawberry Problem

01 · THE SETUP

Through 2024, you could hand a frontier language model a task from a graduate coding interview and watch it pass. Then you could ask:

“How many r's are in the word strawberry?”

And it would answer, with complete confidence: two.

A system that had read more text than every human in history combined, that could write a working compiler, could not count three letters in a ten-letter word. The failure was so famous it got a name: the strawberry problem.

Hold the puzzle open for a second. What could possibly make a machine that has read billions of pages unable to see letters? Not unwilling — unable.

02 · YOUR CALL ⏸ YOUR CALL — PICK ONE TO CONTINUE

Why did models that write working code miscount the r's?

If you pick A

The natural reading — but it doesn't survive contact with the evidence. The same model proved theorems and debugged code, tasks far 'harder' than counting. When ability is that lopsided, the cause is rarely a lack of intelligence. It's usually the senses.

If you pick B

Sensible instinct — rare inputs do cause failures. But 'strawberry' appears millions of times in any web-scale corpus, and the failure repeated across common words ('mayonnaise', 'raspberry'). Frequency isn't the issue. What the model actually sees when it reads a word is.

If you pick C — the mechanism

That's the mechanism. 'Strawberry' reaches the model as two or three opaque chunks — token IDs, not characters. Asking it to count r's is like asking you how many brushstrokes are in a photo of a painting: the information was compressed away before you ever saw it.

If you pick D

Fair suspicion — models do stumble on trick phrasing. But there's no trick here; a seven-year-old gets it right. When a simple, honest question fails while hard ones succeed, look for a difference in how the input arrives, not in the question.

Pick one — committing first is what makes the answer stick.

the lesson continues after you choose

03 · NOT SO FAST

The comfortable conclusion — “so they're not actually intelligent” — feels like it settles something.

But it explains nothing: the same system passes harder tests. The real answer is stranger and more useful. The model wasn't failing to think about the word. It was never shown the word — at least, not the way you see it.

04 · THE MECHANISM

Before any text reaches a language model, a tokenizer chops it into chunks called tokens — common words become one token, rarer ones split into pieces, and each chunk is replaced by a number. “Strawberry” might arrive as st + raw + berry — three IDs from a vocabulary of a couple hundred thousand. The letters are gone. The model then turns each ID into an embedding — a long list of numbers encoding what the chunk means and how it relates to other chunks — and everything downstream operates on those.

What you see vs. what the model sees. The r's live below the token boundary.

The boundary explains a family of oddities: why models flub arithmetic on long numbers (digits get grouped into arbitrary tokens), why rhyme and wordplay are shaky, and why non-English text often costs more tokens — and therefore more money and latency — for the same meaning: tokenizers are trained mostly on English, so other scripts fragment into more pieces.

And the 2026 update: ask a current reasoning model about strawberry and it answers correctly — but not because it can suddenly see letters. It learned to work around its own senses: spell the word out step by step (turning one token into many single-character tokens it can count) or quietly call a tool. The boundary didn't move. The model learned it has one — which is more than most of its users know.

05 · BACK TO THE OPENING

So the strawberry problem was never a stupidity test — it was an anatomy lesson you could run from a chat box. The puzzle inverts: the surprise isn't that a brilliant model failed a trivial task; it's that the task and the model lived on opposite sides of a boundary — your world is made of letters, its world is made of tokens — and 'trivial' doesn't cross that border.

06 · TAKE THIS WITH YOU

Your rule: when a model fails somewhere bizarre, ask “does this task live below the token level?” If it involves exact characters, digits or positions, expect trouble — and route around it: ask the model to spell things out step by step, or give it a tool (a calculator, a script) instead of trusting its senses.

REFERENCES