Think North Learning
thinknorth.consulting
TOKENIZATION Puzzle 5 min

The Strawberry Problem

01 · THE SETUP

Through 2024, you could hand a frontier language model a task from a graduate coding interview and watch it pass. Then you could ask:

“How many r's are in the word strawberry?”

And it would answer, with complete confidence: two.

A system that had read more text than every human in history combined, that could write a working compiler, could not count three letters in a ten-letter word. The failure was so famous it got a name: the strawberry problem.

Hold the puzzle open for a second. What could possibly make a machine that has read billions of pages unable to see letters? Not unwilling — unable.

02 · YOUR CALL ⏸ YOUR CALL — PICK ONE TO CONTINUE

Why did models that write working code miscount the r's?

If you pick A

The natural reading — but it doesn't survive contact with the evidence. The same model proved theorems and debugged code, tasks far 'harder' than counting. When ability is that lopsided, the cause is rarely a lack of intelligence. It's usually the senses.

If you pick B

Sensible instinct — rare inputs do cause failures. But 'strawberry' appears millions of times in any web-scale corpus, and the failure repeated across common words ('mayonnaise', 'raspberry'). Frequency isn't the issue. What the model actually sees when it reads a word is.

If you pick C — the mechanism

That's the mechanism. 'Strawberry' reaches the model as two or three opaque chunks — token IDs, not characters. Asking it to count r's is like asking you how many brushstrokes are in a photo of a painting: the information was compressed away before you ever saw it.

If you pick D

Fair suspicion — models do stumble on trick phrasing. But there's no trick here; a seven-year-old gets it right. When a simple, honest question fails while hard ones succeed, look for a difference in how the input arrives, not in the question.

Pick one — committing first is what makes the answer stick.

the lesson continues after you choose

03 · NOT SO FAST

The comfortable conclusion — “so they're not actually intelligent” — feels like it settles something.

But it explains nothing: the same system passes harder tests. The real answer is stranger and more useful. The model wasn't failing to think about the word. It was never shown the word — at least, not the way you see it.

04 · THE MECHANISM

Before any text reaches a language model, a tokenizer chops it into chunks called tokens — common words become one token, rarer ones split into pieces, and each chunk is replaced by a number. “Strawberry” might arrive as st + raw + berry — three IDs from a vocabulary of a couple hundred thousand. The letters are gone. The model then turns each ID into an embedding — a long list of numbers encoding what the chunk means and how it relates to other chunks — and everything downstream operates on those.

WHAT YOU SEE — 10 LETTERS, 3 R'S s t r a w b e r r y TOKENIZER WHAT THE MODEL SEES — 3 OPAQUE CHUNKS st raw berry → IDs 302 · 1618 · 19772 no token is an “r”. the count you asked for was compressed away before the model ever “saw” the word.
What you see vs. what the model sees. The r's live below the token boundary.

The boundary explains a family of oddities: why models flub arithmetic on long numbers (digits get grouped into arbitrary tokens), why rhyme and wordplay are shaky, and why non-English text often costs more tokens — and therefore more money and latency — for the same meaning: tokenizers are trained mostly on English, so other scripts fragment into more pieces.

And the 2026 update: ask a current reasoning model about strawberry and it answers correctly — but not because it can suddenly see letters. It learned to work around its own senses: spell the word out step by step (turning one token into many single-character tokens it can count) or quietly call a tool. The boundary didn't move. The model learned it has one — which is more than most of its users know.

05 · BACK TO THE OPENING

So the strawberry problem was never a stupidity test — it was an anatomy lesson you could run from a chat box. The puzzle inverts: the surprise isn't that a brilliant model failed a trivial task; it's that the task and the model lived on opposite sides of a boundary — your world is made of letters, its world is made of tokens — and 'trivial' doesn't cross that border.

06 · TAKE THIS WITH YOU

Your rule: when a model fails somewhere bizarre, ask “does this task live below the token level?” If it involves exact characters, digits or positions, expect trouble — and route around it: ask the model to spell things out step by step, or give it a tool (a calculator, a script) instead of trusting its senses.

REFERENCES
  1. OpenAI — interactive tokenizer (see your own text as tokens)
  2. Andrej Karpathy — Let's build the GPT tokenizer
  3. TechCrunch — Why AI can't spell 'strawberry'