The scariest code in your company runs four times a year
Ask an engineering leader about "bus factor" and they'll give you the standard definition: how many people can get hit by a bus (the industry's cheerfully morbid phrasing, softened these days to "win the lottery") before nobody understands some part of the system. It's a people question. Who knows what. Succession planning for nerds.
It's a good question. But there's a sharper version hiding underneath it, and the sharper version isn't about people at all:
Which FUNCTION, if it silently stopped working correctly, would take the longest to notice, the longest to fix, and do the most damage in between?
Because here's the thing the people-framing misses: the answer to that question is almost never your busiest code. It's your quietest.
The paradox of the rarely-called critical function
Think about the functions in your system that run four times a year or less. Not the checkout path — that runs every minute, and precisely BECAUSE it runs every minute, it's watched, exercised, and beaten into shape by sheer traffic. Frequent execution is a form of continuous testing that nobody has to schedule. Your hot path (Part 14 of this series) is, weirdly, often your most trustworthy code.
No — think about:
- The refund handler for the weird case. Not normal refunds — the partial-refund-after-plan-change-with-prorated-credit case. Written in 2022. Runs when it runs.
- The auth edge case. The SSO fallback for that one enterprise client's legacy identity provider. Executes when their session does something unusual, which is to say: during their quarterly board-prep week, exclusively.
- The year-end job. Billing close, tax export, the compliance report. Runs once per year, in the highest-stakes week of the year, written by someone who left.
- The disaster path. The failover routine, the backup restore, the queue-drain script. Runs only when things are ALREADY bad. This code's first real test is always a live performance.
Notice the cruel structure they share. These functions are anti-correlated with observation. They run rarely, so bugs accumulate unexercised. Nobody reads them, so the bugs stay. The environment drifts around them — schemas migrate, APIs version, currencies get added — and unlike hot-path code, NOTHING alerts when a dormant assumption quietly goes stale, because staleness only manifests at call time. And call time, by construction, is a moment of maximum consequence: a refund dispute, an enterprise login crisis, year-end close, an outage. The rarely-called critical function does not fail on a random Tuesday. It fails on the worst possible day, because the worst possible day is the only day it gets called.
It's a smoke detector, except the only scheduled test is an actual fire.
Quiet experts, and why every instrument you own ignores them
Internally we call this category of code "quiet experts" — functions that hold rare, high-stakes competence and never make noise about it. (It's literally a state in CodeNSM's function taxonomy, sitting alongside the flashier ones, and it exists because this exact category kept turning out to matter more than its call count suggested.) The human-org analogy is exact: every company has an employee who does one irreplaceable thing per year — the person who ACTUALLY understands the tax filing — and every company systematically under-tracks them, because all our attention instruments are keyed to activity.
Look at how thoroughly the quiet expert evades your tooling:
- Git can't see it. Version-control forensics — the approach Adam Tornhill's Your Code as a Crime Scene built into a discipline — finds risk where change concentrates. Powerful! But the quiet expert doesn't change. It's had two commits since 2022. Tornhill's own knowledge-map work shows how code whose authors have left becomes terra incognita; the quiet expert is usually deep inside that territory, stable-looking precisely because nobody dares touch it.
- APM can't rank it. Request monitoring sorts by volume and latency. Four calls a year rounds to zero on every dashboard. The quiet expert is on page 40 of the sort order, below functions that format tooltips.
- Alerting has no baseline for it. Anomaly detection needs a normal. What's the "normal error rate" of something that runs quarterly? Its entire production history is eleven data points.
- Deletion review threatens it from the other side. Here's the nasty symmetry with Part 13: dead code and quiet experts LOOK IDENTICAL from the outside — both just sit there, uncalled, for months. A team that gets ambitious about deleting dead code, without call-frequency data over a long enough window, will eventually delete a smoke detector. (This is why every codebase cleanup initiative is haunted by one veteran saying "I wouldn't touch that" and nobody being able to prove them right or wrong.)
The two-list thought experiment
Try this with your team — it takes ten minutes and produces genuine dread, which is this series' love language:
- List every function you can think of that runs less than ~10 times a year but whose failure would trigger a CEO-level phone call. Refunds, auth fallbacks, compliance jobs, restores. Aim for ten entries.
- For each, answer three questions: When did it last run — did it work? Who currently understands it — is that person still here? If it failed silently on its next run, HOW would we find out — and notice whether your honest answer is "a customer tells us."
Most teams cannot complete step one — the list itself requires call-frequency data nobody has. That's worth sitting with: the category of code most likely to cause your worst day is a category your organization cannot currently ENUMERATE, let alone audit. Ward Cunningham's original debt metaphor said the interest comes due when you next touch the code; the quiet expert's interest comes due when the WORLD next touches it, and the world doesn't file tickets in advance.
Rewriting the bus factor
So: the bus factor isn't wrong, it's just aimed at the wrong noun. The people version asks "what does Dana uniquely know?" The function version asks "what does the CODEBASE uniquely know — and is anyone keeping that knowledge alive?" A refund handler that encodes four years of edge-case learning, whose author is gone, and whose logic exists nowhere else — no doc, no test that exercises the weird branch, no human theory in the Naur sense — is a single point of failure with no employee attached. It can't resign, but it can rot, and rot doesn't give notice.
The two versions also multiply, which is where the real nightmares live. Rank your risks on both axes at once: code that runs rarely AND is understood by one person AND that person is a flight risk. That triple intersection — call it the bus factor squared — is the single most dangerous coordinate in your entire company, and no instrument you currently run can even PLOT it, because plotting it needs call-frequency data (which nobody collects) joined to knowledge maps (which live in folklore). You can't manage a risk you can't locate. Right now, yours is unlocated.
The fix isn't heroics; it's bookkeeping. A standing register of rarely-called-but-critical functions — flagged by an instrument rather than by memory, each with a last-successful-run date and a designated understander — plus the one discipline that actually retires the risk: rehearse the smoke detectors. Run the restore. Fire the year-end job against staging in July. Execute the weird refund path with a test account quarterly. Every rehearsal converts a quiet expert's next performance from opening night into a rerun.
Below, the audit. It's short. So is the amount of warning these functions give you.