Think North Learning
thinknorth.consulting
LEARNING PARADIGMS Puzzle 6 min

Three Kinds of Teacher

01 · THE SETUP

Three systems, all routinely called “AI”:

  • A spam filter that catches a phishing email it has never seen before.
  • Discover Weekly, which groups you with strangers who share your taste — though no one ever labelled anyone's taste.
  • AlphaZero, which reached superhuman chess in hours — having studied zero human games.

Here's the puzzle: none of the three was taught the same way. One had an answer key. One had no answers at all. One had only a score.

Before reading on, try to match them. Which one had the answer key? Which had nothing? Which had only a score?

02 · YOUR CALL ⏸ YOUR CALL — PICK ONE TO CONTINUE

AlphaZero saw zero human games and became superhuman anyway. What was its teacher?

If you pick A

That's how earlier chess programs were tuned, so it's a sensible guess. But it's exactly what AlphaZero famously did not use — no human games, no move labels. Its predecessor AlphaGo started from human games; AlphaZero threw even that away.

If you pick B

Plausible — finding structure in unlabelled data is a real learning paradigm (it's how your playlist works). But AlphaZero had no archive to mine. It started from random play. Something else had to tell it what 'better' meant.

If you pick C — the mechanism

Exactly. Its only teacher was a score: +1 for winning, −1 for losing, delivered millions of times as it played itself. No one showed it a single good move. It invented openings humans took centuries to find — and some we'd never found at all.

If you pick D

Reasonable — that was the recipe for Deep Blue in 1997: thousands of hand-tuned rules. AlphaZero got only the rules of the game, nothing about strategy. Which is the point: the knowledge had to come from somewhere else entirely.

Pick one — committing first is what makes the answer stick.

the lesson continues after you choose

03 · NOT SO FAST

The tempting mental model is that “training an AI” is one activity — feed in data, intelligence comes out.

It's reasonable: from the outside, all three systems just consumed data and got smart. But it hides the variable that determines everything about an AI project — what kind of feedback the machine gets. Change the teacher and you change what's learnable, what data you need, and what can go wrong.

04 · THE MECHANISM
SUPERVISED the answer key input → “spam” input → “not spam” input → “spam” learns to reproduce human judgments SPAM FILTER UNSUPERVISED no answers at all finds structure already present in the data DISCOVER WEEKLY REINFORCEMENT only a score act → observe +1 WIN −1 LOSE invents strategies no human demonstrated ALPHAZERO · RLHF bonus, 2026: LLM pretraining is self-supervised — the next word is its own free label
Same student, three teachers: an answer key, no answers, and a score.

Supervised learning — the answer key. Every example comes with the correct output (spam / not spam), and the model adjusts until its answers match. Most deployed business ML — fraud flags, medical image triage, demand forecasts — is this. Its bottleneck: someone must produce the labels.

Unsupervised learning — no answers at all. The model finds structure that was already in the data: clusters of listeners with similar taste, groups of transactions that look alike, the odd one out (that's anomaly detection). Nobody defines the categories; they emerge. Its bottleneck: the structure it finds may not be the structure you meant.

Reinforcement learning — the score. No examples, no labels; just actions, consequences, and a reward signal to maximise. It's how AlphaZero learned chess, how robots learn to walk, and — this surprises people — the finishing school for every chatbot you use: after pretraining, models like Claude and GPT are tuned with reinforcement learning from human feedback (RLHF), where the 'score' is a human preferring one answer over another. Its bottleneck: you get exactly what you score, which deserves its own lesson.

One 2026 addendum the textbooks of 2023 underplayed: the biggest models of all use a fourth trick — self-supervised learning — supervised learning where the labels are free. Take any sentence from the internet, hide the next word, and the hidden word is the label. That's how LLMs turn the whole internet into an answer key with zero human labelling. It's the reason they could scale to trillions of training words while CAPTCHA-style labelling never could.

05 · BACK TO THE OPENING

So the classroom joke resolves: the spam filter had the answer key (supervised), the playlist had no answers and found the structure itself (unsupervised), and the chess engine had only a score (reinforcement). They're not three products of one method — they're three different teachers, and each teacher leaves a different signature on what the student can and cannot do.

06 · TAKE THIS WITH YOU

Your rule: before any ML project — or any vendor pitch — ask “do we have answers, structure, or a score?” Labelled history → supervised. Piles of unlabelled data and a hunch → unsupervised. A measurable outcome you can score repeatedly → reinforcement. If you have none of the three, you don't have an AI project yet; you have a data-collection project.

REFERENCES
  1. DeepMind — AlphaZero: shedding new light on chess, shogi and Go
  2. Silver et al., Science (2018) — A general reinforcement learning algorithm that masters chess, shogi and Go through self-play
  3. Ouyang et al. (2022) — Training language models to follow instructions with human feedback (InstructGPT/RLHF)