What if everything you're measuring about your engineering team is wrong?
That's the question a VP of Engineering faced after a major customer churned. His dashboard looked perfect: commits per day trending up, pull requests merged at record pace, story points completed exceeding forecasts, cycle time shortened by 20%. Every metric was green.
The customer churned anyway. Critical bugs had shipped with features. Support tickets went unanswered because everyone was focused on new development. The product had become unreliable despite the constant stream of new features.
"I don't understand," he said. "Our metrics are excellent."
They weren't excellent. They were just green. The metrics measured activity, not outcomes. They measured volume, not value. The team was optimizing for the metrics rather than for what the metrics were supposed to represent—and in doing so, they'd neglected the things that actually mattered.
This is the fundamental problem with engineering metrics: most organizations measure what's easy to count, not what's important to measure. Lines of code, commits, story points—these are trivially measurable and completely disconnected from business value. When you make these metrics visible and judge people by them, you get exactly what you measured: lots of code, lots of commits, lots of points—and potentially none of the outcomes you actually care about.
At SmithSpektrum, I've helped dozens of engineering organizations design their measurement systems[^1]. The ones that measure well focus on outcomes that matter to the business, track leading indicators that predict those outcomes, and resist the temptation to measure activity for its own sake. The ones that measure poorly create dashboards that look impressive while the organization rots.
The Goodhart's Law Problem
Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. Engineering is particularly susceptible to this.
If you measure lines of code, you get verbose code. Engineers write more code because that's what gets rewarded, even when less code would be better. Code becomes harder to maintain, but the metric is green.
If you measure story points completed, you get point inflation. Teams estimate higher so they can show more completion. A task that was 3 points becomes 5 points, and the velocity chart looks better while actual output stays the same.
If you measure number of pull requests merged, you get smaller pull requests. This sounds good—small PRs are generally better—until you realize that engineers are splitting work artificially to inflate their numbers, creating coordination overhead and deployment complexity.
If you measure cycle time, you get code rushed through review. The metric says "move fast," so thorough review becomes a bottleneck to eliminate rather than a quality gate to value.
The problem isn't measurement itself. It's measuring activities instead of outcomes, and then treating those measurements as targets. Activities are downstream of outcomes—they're things you do to achieve outcomes—but they're not the outcomes themselves. When you target activities, you lose sight of what the activities were supposed to accomplish.
What Actually Matters
The metrics that matter are the ones connected to business outcomes.
Customer value delivered is the ultimate measure. Are customers getting value from what you build? This shows up in customer satisfaction, retention, feature adoption, and revenue growth. These metrics aren't directly controllable by engineering, but engineering contributes to them—and that's the point.
| Metric Category | Good Metrics | Vanity Metrics | What to Watch For |
|---|---|---|---|
| Delivery | Lead time, deployment frequency | Lines of code | Gaming by splitting PRs |
| Quality | Change failure rate, MTTR | Bug count | Hiding bugs as features |
| Health | Team satisfaction, retention | Happiness scores | Survey fatigue |
| Impact | Customer metrics, business outcomes | Story points velocity | Correlation ≠ causation |
Product quality affects customer experience. Bug rates, particularly customer-facing bugs. Incident frequency and severity. Support ticket volume and resolution time. These measure whether what you're shipping actually works.
Team health affects sustainable delivery. Retention and attrition. Engagement scores. Burnout indicators. A team producing lots of code while burning out will eventually stop producing anything.
Capability development affects future delivery. Are you building skills, platforms, and systems that will make future work easier? Technical debt trajectory tells you whether you're building capability or consuming it.
These metrics are harder to measure than commits per day. They require connecting engineering work to business outcomes. They can't be gamed as easily because they measure real things in the real world. And they focus engineering attention on what actually matters: delivering value to customers in a sustainable way.
The DORA Metrics
The DORA metrics—deployment frequency, lead time for changes, change failure rate, and time to restore service—have become industry standard for measuring engineering effectiveness, and for good reason: they're outcome-oriented and predictive.
Deployment frequency measures how often you can ship to production. More frequent deployment correlates with better outcomes because it means smaller batches, faster feedback, and the ability to iterate quickly. It's not directly about business value, but it's a leading indicator—teams that can deploy frequently can deliver value faster.
Lead time for changes measures how long it takes from code commit to production. Shorter lead time means faster feedback and faster delivery. Long lead times indicate bottlenecks—slow review, slow testing, slow deployment—that prevent value from reaching customers.
Change failure rate measures what percentage of changes cause failures requiring intervention. Lower is better. This measures quality in a way that connects to customer impact—failures are things customers experience.
Time to restore service measures how quickly you can recover from failures. Failures happen; the question is how long customers are affected. Faster recovery means more reliable service from the customer perspective.
These metrics are better than activity metrics because they measure capability—the ability to deliver value quickly and reliably—rather than just volume of activity. A team with excellent DORA metrics has the capability to be excellent; a team with excellent commit counts may or may not.
Leading Versus Lagging Indicators
Understanding the relationship between leading and lagging indicators helps design measurement systems.
Lagging indicators tell you what happened. Customer retention is a lagging indicator—by the time it drops, the problems that caused the drop already happened. Revenue is lagging. NPS is lagging. These metrics are important but not actionable in real-time.
Leading indicators predict what will happen. Code review cycle time is a leading indicator for lead time. Test coverage trends predict change failure rate. Team engagement scores predict retention. These are actionable—if you see a leading indicator deteriorating, you can intervene before the lagging indicator reflects the damage.
Good measurement systems include both. Lagging indicators tell you whether you're achieving outcomes. Leading indicators help you predict and prevent problems. If you only measure lagging indicators, you'll always be reacting to problems that have already occurred.
For engineering teams, useful leading indicators include developer experience survey results (predicts retention and productivity), deployment pipeline health (predicts lead time), test coverage and test quality (predicts change failure rate), on-call burden and toil (predicts burnout), and codebase health metrics like complexity and dependency trends (predicts future velocity).
Measuring Without Gaming
Designing metrics that resist gaming requires thought.
Combine metrics that trade off against each other. If you measure both velocity and quality, optimizing one at the expense of the other doesn't help your overall score. If you measure both delivery speed and customer satisfaction, shipping fast but broken doesn't look good.
Measure outcomes rather than outputs where possible. "Customer retention" is harder to game than "features shipped" because retention depends on the features actually being valuable. "Revenue per engineer" is harder to game than "commits per engineer" because revenue requires real value creation.
Use multiple signals for important concepts. Quality isn't just bug count—it's also support tickets, customer feedback, incident frequency, technical debt trends. Multiple signals make gaming harder because you'd have to game all of them.
Interpret metrics in context. A metric that improves while other metrics deteriorate isn't actually good. A metric that deteriorates for known reasons (a major refactor, a team transition) shouldn't trigger alarm. Human judgment must contextualize numbers.
Avoid individual metrics for activities. Measuring individual engineers on commits or PRs creates gaming and competition. Team-level outcome metrics align incentives better.
What to Stop Measuring
Some metrics actively harm more than help.
Lines of code should not be measured. More code is not better code; often it's worse. Measuring lines of code encourages verbosity and penalizes elegant, concise solutions. If you're measuring lines of code, stop.
Individual developer velocity comparisons should not be measured. Comparing story points or throughput between individual developers creates competition where you want collaboration. It penalizes engineers who help others, who do code review, who mentor—all activities that don't generate individual output but make the team better.
Time tracking at the task level should not be measured (for most teams). Tracking how long each task took encourages sandbagging estimates, discourages taking on hard problems, and creates overhead that doesn't improve outcomes. Unless you have a specific reason to need this data, it's not worth collecting.
Merge commit counts should not be measured. This metric is trivially gamed and doesn't correlate with value. A single well-designed feature contributes more than ten trivial commits, but the metric doesn't know that.
Meeting counts and utilization should not be measured. Measuring "percentage of time in meetings" or "percentage of time coding" doesn't tell you anything useful. The right amount of meetings depends on the work; an architect might need many meetings while a deep technical engineer might need few. Neither is better.
Implementing Measurement Effectively
Designing good metrics is the beginning; implementing them well is equally important.
Communicate the intent behind metrics. If you measure deployment frequency, explain that you want the capability to ship quickly, not that you expect deployments every day regardless of whether there's something to deploy. Intent prevents misinterpretation.
Review metrics as a team and discuss what they mean. Numbers without conversation are easily misunderstood. Regular discussions about what the metrics show—and don't show—build shared understanding.
Change metrics when they stop being useful. A metric that was valuable six months ago might not be valuable now. If a metric isn't informing decisions, stop collecting it. The goal is insight, not comprehensive tracking.
Don't let metrics replace judgment. Metrics inform decisions; they don't make decisions. A manager who says "the metric says X therefore we do Y" is abdicating judgment. Metrics are inputs to thinking, not substitutes for it.
Avoid metric overload. Too many metrics is worse than too few. You can't pay attention to twenty dashboards. Identify the few metrics that actually matter and focus on those.
Example Measurement System
A reasonable measurement system for an engineering team might include the following.
Business outcome metrics (reviewed monthly): customer satisfaction, feature adoption, revenue metrics if relevant. These connect engineering to business value.
Product quality metrics (reviewed weekly): customer-facing bug rate, incident count and severity, support ticket volume. These measure whether what you ship works.
Delivery capability metrics (reviewed weekly): DORA metrics—deployment frequency, lead time, change failure rate, recovery time. These measure the team's ability to deliver.
Team health metrics (reviewed monthly): engagement survey results, retention, on-call burden. These measure sustainability.
Leading indicators (reviewed continuously): deployment pipeline health, code review cycle time, test suite performance, codebase complexity trends. These predict future problems.
This isn't a comprehensive list—different teams need different metrics—but it illustrates the structure: outcome metrics that matter, capability metrics that predict, leading indicators that warn, and a manageable number overall.
The VP whose metrics were all green while customers churned? He rebuilt his measurement system.
He kept some activity metrics—they were useful for spotting anomalies—but he stopped treating them as targets. He added customer-facing quality metrics that he reviewed weekly. He started looking at support tickets and customer feedback, not just feature velocity.
Six months later, some of his activity metrics had actually gone down—fewer features shipped, lower "velocity"—but customer satisfaction had improved. The team was shipping less but shipping better. Bugs were caught before production. Support tickets were declining. The metrics that mattered were moving in the right direction.
"I used to think measurement was about proving we were productive," he said. "Now I think it's about knowing whether we're creating value. Those are different things."
References
[^1]: SmithSpektrum engineering leadership advisory, metrics design, 2019-2026. [^2]: Forsgren, Humble, Kim. "Accelerate: The Science of Lean Software and DevOps," 2018. [^3]: DORA, "State of DevOps Report," annual editions 2018-2025. [^4]: Goodhart, Charles. "Problems of Monetary Management: The UK Experience," 1975.
Designing engineering metrics? Contact SmithSpektrum for measurement system design and organizational effectiveness.
Author: Irvan Smith, Founder & Managing Director at SmithSpektrum