You have never once believed, on reflection, that the number of times a paper is cited equals how good it is. And yet the number runs your professional life: it shapes which papers you read first, which journals you submit to, who gets the grant, who gets tenure. We use citation counts as a stand-in for quality not because anyone defends the equation but because the count is there — countable, comparable, already computed. It is the streetlight, and quality is somewhere out in the dark.
What the count is actually a measure of
Citations measure attention, and attention is correlated with quality only loosely and only on average. Dag Aksnes, Liv Langfeldt and Paul Wouters, reviewing decades of scientometric evidence, put it carefully: citations reflect aspects of scientific impact and visibility, but they are a limited and sometimes misleading proxy for the broader notion of quality, and their meaning varies wildly by field, by paper type, and by time. A methods paper accrues citations for being useful; a review for being convenient; a wrong-but-provocative paper for being argued with. The count cannot tell these apart, because it is a single scalar sitting on top of a dozen different human motives for typing a reference.
Per Seglen made the sharpest version of the case against the journal-level version of this metric back in 1997, in the BMJ: the impact factor of the journal a paper appears in does not predict the citation rate of the individual paper, because citations within a journal are massively skewed — a few papers earn most of the citations and most earn almost none. Judging a paper by its journal's average is like judging a person by their city's average income. The San Francisco Declaration on Research Assessment, now signed by thousands of organisations, exists specifically to get journal impact factors out of the evaluation of individual papers and people.
Goodhart's law, wearing a lab coat
There is a deeper reason the count fails, and it is structural rather than statistical. The anthropologist Marilyn Strathern gave the crispest formulation of what is usually called Goodhart's law: when a measure becomes a target, it ceases to be a good measure. The instant citations became the thing careers are optimised for, behaviour reorganised around producing citations rather than the quality citations were supposed to track — salami-slicing results across more papers, citation cartels, strategic self-citation, and the least-count-per-idea publishing that inflates the denominator of everything.
A citation count was a fair weather-vane right up until we started paying people for which way it pointed. After that it measured the wind machine, not the wind.
This is not cynicism; it is the predictable result of attaching high stakes to a cheap proxy. The Leiden Manifesto, drafted by Diana Hicks, Paul Wouters and colleagues and published in Nature in 2015, laid down ten principles for research metrics precisely because the metrics had started governing the science instead of describing it. Principle one: quantitative evaluation should support, not supplant, expert assessment. The manifesto exists because we inverted that order.
Why we tolerate a proxy we all distrust
So why does the count survive every critique? Because the alternative it replaced — reading the paper and judging it on its merits — does not scale, and its own scaled-up version, peer review, is under strain of its own. When Peter Rothwell and Charles Martyn measured the agreement between reviewers assessing the same neuroscience submissions, they found inter-reviewer agreement barely better than chance. If the human quality-judgment process is that noisy and that expensive, a cheap number that everyone can compute starts to look attractive despite everyone knowing what it isn't.
That is the trap: the proxy is bad, and the honest alternative doesn't scale, so we keep the bad proxy and complain about it. Breaking the trap means making the honest alternative scale — building a read of a paper's intrinsic quality that does not route through the crowd's attention at all.
What an intrinsic-quality read would have to look at
What would you actually inspect if you were forbidden from looking at how many times a paper was cited? Concretely:
- The structure of its argument — does the conclusion follow from the premises, or is there a gap the fluent prose walks you over?
- The strength of each claim on its own terms — is this assertion backed by data in the paper, by a citation, or by nothing but confident phrasing?
- The integrity of its citations — do the papers it leans on actually say what it says they say, and do they reach evidence or just more pointers?
- Its methodological and statistical footing — is the design capable of supporting the claim, is the sample adequate, are the limitations stated or buried?
Notice that none of these require knowing whether anyone else has read the paper. They are properties of the work itself — a citation-free quality read, computable in principle from the paper and its reference graph alone. That is a very different object from a citation count, and building it is a very different project from counting. The next part looks at the tools that got closest to helping you here — the ones that map how papers relate to each other — and asks why relatedness, useful as it is, still leaves the argument unread.