DORA Metrics Are Not Enough. Here Is What We Measure Instead.
DORA metrics are not enough. Here is what we measure instead.
Every engineering team now tracks DORA. Deployment frequency. Lead time for changes. Change failure rate. Mean time to recovery. Four numbers. Clean. Simple. Leadership loves them.
I loved them too. Then AI tooling arrived and the numbers stopped making sense.
What went wrong
We rolled out AI coding assistants across the org about eighteen months ago. The early results looked good. Code quality scores went up 7.5%. Reviews got faster. People were happy.
Six months later, delivery stability had dropped 7.2%. More production incidents. More rollbacks. Our DORA scores still looked fine on paper because we were deploying more often and recovering quickly. But the thing we were recovering from was our own output.
We dug into the data. Code churn had nearly doubled. Engineers were generating more code, reviewing it less carefully, and merging it faster. The AI tools made writing code easy. They did not make writing correct code easy.
DORA told us how fast we shipped. It did not tell us whether what we shipped was worth shipping.
The missing pieces
DORA measures the delivery machine. It says nothing about the people running the machine or the quality of what comes out the other end. We needed two more lenses.
SPACE for developer satisfaction
SPACE is a framework from Microsoft Research. It looks at five things: satisfaction and wellbeing, performance, activity, communication and collaboration, efficiency and flow. We run a short survey every two weeks. Ten questions. Takes three minutes.
What surprised me: our deployment frequency was at an all time high, but developer satisfaction was falling. Engineers told us they felt like "merge button operators." The AI did the writing. They did the reviewing. Reviewing AI generated code all day is tiring in a way that writing your own code is not. DORA could not see this. SPACE could.
DevEx DX for cognitive load
DevEx DX measures three things: feedback loops, cognitive load, and flow state. We found that AI tools had shortened feedback loops (code appeared faster) but increased cognitive load (more code to review per hour, less understanding of what the code actually does). Engineers were context switching more. Flow state time went down.
When cognitive load goes up and flow state goes down, bugs follow. That explained our stability drop better than any DORA metric could.
Measuring AI specifically
Most metrics frameworks were designed before AI coding tools existed. They assume a human wrote the code and another human reviewed it. That assumption is breaking down. We added three measures to deal with this.
First, tool call accuracy at depth. We track how often AI suggestions are correct not just at the surface level but three or four layers into the logic. Shallow suggestions are usually fine. Deep ones are where the trouble lives. An AI can write a perfectly valid function that quietly breaks an invariant two modules away because it never saw that module.
Second, context utilisation. How much of the codebase context does the AI actually use when generating a suggestion? Most tools use a narrow window. They miss business rules defined in a different file. They miss that naming convention your team agreed on six months ago. This creates code that works in isolation but breaks things in integration. We track this and it tells us which parts of the codebase are too opaque for AI to work with safely.
Third, per decision cost. Every AI suggestion that gets accepted is a decision somebody made. We estimate what it would cost to get that decision wrong. Anything touching payments, data pipelines, or authentication gets manual review regardless of how confident the AI seems. Formatting and boilerplate we let through faster. This sounds obvious. But without measuring it, we found engineers were applying the same level of scrutiny to everything, which meant the high risk code got less attention because they were tired from reviewing low risk code.
One dashboard, three layers
We put all of this on a single page. It took three months to build and two more to get the thresholds right. I can walk the CTO through it in five minutes.
DORA metrics sit at the top. They tell you the speed of the machine. How fast are we deploying. How fast do we recover when something breaks.
SPACE and DevEx DX sit in the middle. They tell you the health of the people. Are engineers in flow or are they drowning in review work. Are they satisfied or are they quietly burning out.
AI specific metrics sit at the bottom. They tell you whether the new tools are genuinely helping or just making everyone feel productive while creating problems that show up next quarter.
The reading rule we use: if DORA looks good but SPACE is declining, something is wrong that DORA cannot see. If all three look good but AI accuracy at depth is low, you are building debt you will pay for later. Both of these happened to us before we had the full picture.
What changed
Over twelve months we hit 95% test automation. Not because we set that as a target. The dashboard made it obvious where manual testing was burning people out, so teams automated those areas first. Defects dropped 80%. QA cycle time came down 70%. Those numbers came from fixing real problems the dashboard surfaced, not from chasing a metric.
The number I watch most closely now is developer satisfaction. It went back up. Engineers feel like they are building things again, not just approving things an AI built. That matters more to me than deployment frequency because satisfied engineers write better code and stay longer.
What I would tell you if we were having coffee
DORA is necessary. It is not sufficient. If you are using AI tools in your engineering org and only tracking DORA, you are driving at night with half your headlights out. The road looks fine until it isn't.
Add SPACE. Add DevEx DX. Add AI specific quality measures. The whole setup costs you one short fortnightly survey and some instrumentation around your AI tooling. That is not a lot of effort for a view of engineering health that actually matches reality.
Your deployment frequency will tell you how fast you are going. The other metrics will tell you whether that speed is safe.