Think North Learning
thinknorth.consulting
ATTENTION Close Observation 6 min

The Word 'It' Waited For

01 · THE SETUP

Read these two sentences slowly:

“The trophy doesn't fit in the suitcase because it is too big.”
“The trophy doesn't fit in the suitcase because it is too small.”

You resolved “it” differently each time — trophy, then suitcase — instantly, without noticing you'd done anything. Now the observation: this one-word flip, called a Winograd schema, was designed as an AI test precisely because for roughly fifty years no machine could do what you just did. Grammar can't answer it. Both readings parse perfectly.

Look closer at your own reading. To pick 'trophy' over 'suitcase', what did you have to connect? List what a machine would need.

02 · YOUR CALL ⏸ YOUR CALL — PICK ONE TO CONTINUE

What finally cracked it?

If you pick A

The field's first fifty years agreed with you — and spent themselves proving it wrong. Both sentences are grammatically identical; no rule about words-in-general decides between them. The answer needs the specific words to talk to each other.

If you pick B — the mechanism

That's the mechanism — it's called attention. When the model processes 'it', it computes how much every other word matters right now: 'big' pulls the reference toward trophy, 'small' toward suitcase. Meaning becomes a set of learned, shifting connections rather than a fixed lookup.

If you pick C

Reasonable — brute force solves many things. But Winograd schemas were built to break it: researchers generated fresh pairs that had never appeared anywhere. Machines kept failing on new ones. Whatever cracked it had to compute the answer, not remember it.

If you pick D

This was seriously attempted — decades of building databases of common-sense facts (look up the Cyc project). It never scaled: the world has too many facts, and they combine in too many ways. The winning move turned out to be letting the knowledge emerge instead of typing it in.

Pick one — committing first is what makes the answer stick.

the lesson continues after you choose

03 · NOT SO FAST

The intuitive fix is “give the machine more knowledge” — bigger dictionaries, more rules, more facts. Everyone's first instinct, including the field's, for half a century.

But knowledge wasn't the bottleneck. The bottleneck was routing: even a machine that knows trophies and suitcases needs a way for the word “big” to reach back and touch “trophy” — dynamically, differently in every sentence. What was missing was a mechanism for words to talk to each other.

04 · THE MECHANISM
…BECAUSE IT IS TOO BIG trophy suitcase it big “it” = trophy (strong link wins) …BECAUSE IT IS TOO SMALL trophy suitcase it small “it” = suitcase (weights redistributed) one changed word re-routes the whole computation — meaning is routed, not stored
Attention while processing 'it': one changed word redistributes the weights, and the reference flips.

The mechanism is attention: as the model processes each token, it computes a relevance score to every other token in the context, then blends information from them in proportion. Dozens of attention 'heads' run in parallel, each learning a different kind of relationship — one tracks who did what to whom, another binds pronouns, another matches brackets in code. In 2017, a Google paper made a then-radical claim in its title — “Attention Is All You Need” — and threw away everything else. The resulting architecture, the transformer (the T in GPT), processed all tokens in parallel instead of one-by-one, which meant it could finally be trained at internet scale.

Two consequences reach into your daily use. First, the older sequence models this replaced — RNNs and LSTMs, the state of the art most 2023-era explainers (this book's first edition included) described — are now legacy; transformers run essentially everything: text, images, audio, video, protein folding. Second, the context window — the amount a model can attend over — became the defining spec, growing from ~2,000 tokens (GPT-3) to a million-plus. But attention over huge contexts is expensive and imperfect: research shows models recall the start and end of long contexts better than the middle — the “lost in the middle” effect. A million-token window is not a million-token memory.

05 · BACK TO THE OPENING

So the trophy and the suitcase were never a party trick. That one-word flip marks the exact spot where machine language understanding stood blocked for fifty years — and the mechanism that finally resolved “it” is the same one, scaled a billionfold, inside every model you now type at. You watched the hardest problem in NLP happen in your own head, in the opening two lines.

06 · TAKE THIS WITH YOU

Your rule: when a model ignores something you told it thirty pages ago, you're not seeing forgetfulness — you're seeing an attention budget spread thin. Move the load-bearing facts next to the question, restate them, or start a fresh session. Placement inside the context window is a real variable; treat it like one.

REFERENCES
  1. Vaswani et al. (2017) — Attention Is All You Need
  2. Levesque et al. — The Winograd Schema Challenge
  3. Liu et al. (2023) — Lost in the Middle: How Language Models Use Long Contexts