2026-06-16 · 6 min read

03 — Green Means Running, Not Understood

Every signal says the system is healthy. None of them is measuring what's actually failing.

The objection at the end of the last essay was the right one, and it deserves to be taken at full strength before it is taken apart. Surely we would notice. We are not flying blind. We have a wall of instruments — test suites, pipelines that go green or red, dashboards that track how fast the team is moving, even metrics that claim to tell us where knowledge sits and how exposed we'd be if someone left. If a system were drifting out of anyone's understanding, something on that wall would twitch.

It is a fair thing to believe. It is also the most dangerous thing to believe, because every instrument on that wall was built to measure something else, and the calm they report is not the calm you think it is.

Start with the tests, because they are the instrument people trust most. A passing test tells you that a particular line of code ran and produced the answer someone expected. That is worth having. But it certifies that the behaviors someone thought to check still hold — not that the right behaviors were checked, and not that anyone understands why they're the right ones. A suite can be large, green, and thorough about exactly the things its author already understood, while saying nothing at all about the things no one understood well enough to write a test for. The green is real. It is just answering a narrower question than the one you're asking it.

The pipeline is the same move at a larger scale. A green build means the thing compiles, links, deploys, and passes its checks. It means the system runs. It was never designed to mean the system is understood, and it has no way to tell the difference between code a person reasoned through and code that was generated, merged, and never read. Both turn the light green. The light was only ever wired to one of them.

Velocity is worse, because velocity is the symptom wearing the costume of the cure. The whole premise of this series is that output has outrun comprehension. A team shipping faster than ever is producing exactly the condition the debt grows in — and the chart that shows them shipping faster is read, in most rooms, as evidence that things are going well. The faster the line climbs, the more reassured everyone is, and the faster the gap behind it widens. The instrument that should be the alarm is being read as the all-clear.

Then there is review, which is supposed to be the human backstop for everything the machines miss. Review worked when the rate of code arriving for it matched the rate a person could actually think about. That coupling is gone too. When generated changes arrive faster than anyone can study them, review does not stop happening — it stops meaning what it meant. It becomes approval at the speed of scrolling, a signature certifying that someone looked, which is not the same as someone understood. The ritual survives intact. The thing the ritual was for does not.

And then the instrument that doesn't just go quiet but actively misleads. There are well-known ways to measure where knowledge lives in a team — who would have to leave before a system became unmaintainable, how concentrated the understanding is, how exposed you are. Every one of them works by looking at who wrote the code. That was a sound proxy for fifty years, because writing code required understanding it, so authorship was a fingerprint of comprehension. It is not sound anymore. When a person can author a module they never understood — accept it, merge it, attach their name to the commit — authorship stops being evidence of anything. The metric still runs. It still returns a confident number. The number is now computed from a signal that came unhooked from the thing it stood in for, and a confident number measuring the wrong quantity is more dangerous than no number at all, because it ends the conversation. You asked who understands this system, a tool answered, the answer was authoritative, and the answer was about something else entirely.

And the instinct that comes next — that the metric can be repaired, that you could credit the reviewer instead of the author, or weight it by how much someone worked the code — is the dead end dressed as the fix. Each of those is still a record of who handled the thing, and handling it is now equally consistent with having understood it completely, in part, or not at all. No accounting of who touched the code recovers who understands it. The signal the metric needs is not miscalibrated. It is gone.

Notice what these instruments have in common. Tests watch the artifact. Pipelines watch the build. Velocity watches output. Authorship metrics watch the commit log. Every one of them is pointed at something an organization produces — and the thing that's failing is not in anything you produce. It's in whether anyone can still account for what was produced, and that lives in cognition, where none of these instruments are looking. They are not broken. They are aimed at the wrong place, and they are aimed there precisely because, in the world they were designed for, the wrong place and the right place were the same place. Output used to imply comprehension. The instruments still assume it does.

Every signal you have agrees the system is healthy. None of them is measuring the thing that's failing.

And it is not only the instruments you own that point the wrong way. The one serious attempt to measure even the easy version of this — whether the tools make people faster at all — fell apart before it could give an answer, because within a year the developers in it had grown so unwilling to work without the tools that there was no one left to form a control group. Sit with what that means. A field that cannot hold a control group for its simplest question is in no position to be quietly measuring its hardest one. The measurement is not lagging behind the problem. It has not started. Even the largest industry telemetry confesses it in its own pages: it can watch a codebase heave, work torn out and rewritten at a scale it has never seen, and admit in the same report that it cannot tell whether the churn is waste or repair.

Which leaves a harder question than the one we started with. It is not enough to say the existing instruments are blind — that only tells you where not to look. If none of the things you currently measure touches the quantity that actually matters, then the next move isn't to find a better dashboard. It's to ask what it would even mean to measure this — to take a thing that has lived entirely in people's heads and turn it into something you can read.