The Word 'It' Waited For
Read these two sentences slowly:
“The trophy doesn't fit in the suitcase because it is too big.”
“The trophy doesn't fit in the suitcase because it is too small.”
You resolved “it” differently each time — trophy, then suitcase — instantly, without noticing you'd done anything. Now the observation: this one-word flip, called a Winograd schema, was designed as an AI test precisely because for roughly fifty years no machine could do what you just did. Grammar can't answer it. Both readings parse perfectly.
Look closer at your own reading. To pick 'trophy' over 'suitcase', what did you have to connect? List what a machine would need.
What finally cracked it?
Pick one — committing first is what makes the answer stick.
the lesson continues after you choose
The intuitive fix is “give the machine more knowledge” — bigger dictionaries, more rules, more facts. Everyone's first instinct, including the field's, for half a century.
But knowledge wasn't the bottleneck. The bottleneck was routing: even a machine that knows trophies and suitcases needs a way for the word “big” to reach back and touch “trophy” — dynamically, differently in every sentence. What was missing was a mechanism for words to talk to each other.
The mechanism is attention: as the model processes each token, it computes a relevance score to every other token in the context, then blends information from them in proportion. Dozens of attention 'heads' run in parallel, each learning a different kind of relationship — one tracks who did what to whom, another binds pronouns, another matches brackets in code. In 2017, a Google paper made a then-radical claim in its title — “Attention Is All You Need” — and threw away everything else. The resulting architecture, the transformer (the T in GPT), processed all tokens in parallel instead of one-by-one, which meant it could finally be trained at internet scale.
Two consequences reach into your daily use. First, the older sequence models this replaced — RNNs and LSTMs, the state of the art most 2023-era explainers (this book's first edition included) described — are now legacy; transformers run essentially everything: text, images, audio, video, protein folding. Second, the context window — the amount a model can attend over — became the defining spec, growing from ~2,000 tokens (GPT-3) to a million-plus. But attention over huge contexts is expensive and imperfect: research shows models recall the start and end of long contexts better than the middle — the “lost in the middle” effect. A million-token window is not a million-token memory.
So the trophy and the suitcase were never a party trick. That one-word flip marks the exact spot where machine language understanding stood blocked for fifty years — and the mechanism that finally resolved “it” is the same one, scaled a billionfold, inside every model you now type at. You watched the hardest problem in NLP happen in your own head, in the opening two lines.
Your rule: when a model ignores something you told it thirty pages ago, you're not seeing forgetfulness — you're seeing an attention budget spread thin. Move the load-bearing facts next to the question, restate them, or start a fresh session. Placement inside the context window is a real variable; treat it like one.