The scene is familiar. Production is broken, and a handful of engineers are on a call. Logs are being checked, someone is querying data, someone else is digging through traces, and every few minutes a voice says, "We think we're close." Everyone is busy. Everyone is trying. Nobody is being lazy.
But the customer is still broken.
That is the part teams quietly normalize. The incident bridge feels like the center of the story, full of urgency and smart people working hard, but it is not. The customer impact is the center of the story, and once a defect is in production, the customer should not become part of your debugging environment.
Debugging Is Not Mitigation
It is easy to confuse motion with mitigation. A team can be investigating, adding logs, trying to reproduce the issue, checking dashboards, digging through traces, and preparing another release, and all of that can be useful work. None of it necessarily reduces customer pain. Debugging is not mitigation. Investigation is not recovery. Activity on the bridge is not the same thing as protecting the customer.
When the product is broken in production, the first question should not be "Are we close to understanding it?" The first question should be "Can we safely stop the bleeding?" Those are very different questions, and the order you ask them in tells you who the team is actually optimizing for.
Why It Feels Reasonable on the Bridge
None of this is obvious in the moment, and it is worth being honest about why. The team is under pressure. They may genuinely believe the fix is close. They may worry that rolling back will make things worse. They may not trust their deployment pipeline, or know whether the previous version is safe, or have a clean way to roll back the database. They may have feature flags, but not the right ones, and no easy way to isolate the broken behavior.
So they say things that sound responsible and engineering-minded: that they prefer to fail forward, that they can't reproduce it in lower environments, that the next build will probably fix it, that they need more instrumentation, that they do not want to roll back until they know exactly what happened. Sometimes those statements are even true. But they can also quietly hide the real issue, which is that the system was never built to recover cleanly.
Fail Forward Is Not a Permission Slip
Fail-forward is not automatically wrong. There are real cases where a forward fix is safer than a rollback. It becomes dangerous when it stops being a judgment call and turns into a reflex, especially when it is used to justify continued customer impact while the team searches for certainty. Fail-forward is a recovery tactic, not a permission slip to keep customers broken while engineering looks for the answer.
A lot of that instinct is inherited from a world where releases were expensive. You shipped rarely, rollbacks were scary, and deployment was a ceremony, so pushing another fix could feel reasonable because every release was already painful. In a modern CI/CD world, that thinking is legacy baggage. If releases are cheap, rollbacks should be cheap too. And if rollbacks are not cheap, that is not an excuse. That is the problem.
If You Can Only Understand Production in Production
"We can't reproduce it outside production" should not end the conversation. It should start a different one. It often points to missing test data, weak environment parity, thin logging, poor observability, hidden dependencies, unsafe configuration drift, or a system that only behaves correctly under real production load.
That does not mean every production defect can be perfectly reproduced somewhere safe. Real systems are messy, and some behavior only shows up at scale. But if a team repeatedly needs production to understand production, that is not bad luck. It is an engineering hygiene problem. Not being able to reproduce a problem outside production is not just an inconvenience. It is a signal that the organization has been borrowing confidence from the customer.
Shipping a Microscope Is Not Recovery
There is a particularly revealing version of this, where instead of mitigating, the team ships a new release that adds more instrumentation to help find the problem. Sometimes that is genuinely necessary, but it is worth being clear about what it means. The customer is still absorbing the failure while the team improves its view of the failure. You are turning production into a better microscope while the customer is still under the glass.
A release that only helps you understand the outage is not a recovery plan. It is a confession that the system was too opaque when it mattered. That does not make instrumentation releases always wrong, but they should feel like a last resort, not a standard step in the incident playbook.
The Bill You Never See
This is where it connects to systems thinking. Engineering teams tend to optimize for the costs they can measure: time on the bridge, number of incidents, mean time to recovery, deploy counts, tickets closed. Those numbers are real, but some of the most expensive damage never shows up in them. Customer confidence drops, support volume climbs, and the account team has to explain what happened. Sales loses a little credibility, renewals get harder, and executives start quietly asking whether the platform is stable. Customers build workarounds, and some of them stop trusting the product at exactly the moment they needed it to work.
That is the real bill, and it is hard to act on precisely because it is hard to see. Customer confidence is expensive because you usually do not get a clean invoice for losing it.
Mitigation Before Diagnosis
The better model is not "always roll back." That is too simplistic. The better model is that when production is harming customers, mitigation comes before diagnosis. Diagnosis still matters, root cause still matters, and the permanent fix still matters, but the order matters more than teams admit.
A mature production system gives you options when something breaks. You can roll back the release, disable the feature, turn off the integration, route traffic away, revert the config change, hit a kill switch, restore a known-good path, or degrade gracefully. Protect the customer first, then debug with less pressure. That is boring software in incident form. It is not just software that does not break. It is software that gives you safe moves when it does.
The Customer Is Not Your Debugger
The point of boring software is not that nothing ever goes wrong. Things go wrong. The point is that when something does go wrong, the customer does not have to sit there while engineering discovers how the system actually works.
You can hear the difference in how a team talks on the bridge. The weaker posture is "we are still trying to figure out what happened." The stronger one is "we have stopped the customer impact, and now we are figuring out what happened." Same incident, very different operating model. And when the only way to understand a system is to break it in front of customers, the problem is not just the defect. The problem is the operating model.
A mature engineering organization does not prove itself by how intensely it can debug production. It proves itself by how quickly it can protect customers when production starts lying. Rollback, feature flags, kill switches, observability, test data, and lower-environment parity all exist for the same reason: to keep production from becoming the place where the team finally learns the truth. The customer is not part of your debugging process. Recovery is your job, not theirs.
Frequently asked questions
Is this article saying you should always roll back?
- No. Always rolling back is too simplistic, and sometimes a forward fix really is the safest way to reduce customer impact. The point is about order: when production is harming customers, mitigation comes before diagnosis. Stop the bleeding first, then debug with less pressure.
What is wrong with "fail forward"?
- Nothing, when it is genuinely the safest way to reduce customer impact. It becomes dangerous when it turns into a reflex that justifies continued customer pain while engineering looks for certainty. Fail-forward is a recovery tactic, not a permission slip to keep customers broken.
We genuinely can't reproduce the issue outside production. Now what?
- That is a real situation, but it should start a different conversation rather than end one. Repeatedly needing production to understand production points to gaps in logging, observability, test data, or environment parity. Not being able to reproduce a problem outside production is a signal that the organization has been borrowing confidence from the customer.
Isn't shipping more instrumentation during an incident a good thing?
- Sometimes it is necessary, but it should feel like a failure mode, not a normal step. A release that only helps you understand the outage is not a recovery plan, it is a confession that the system was too opaque when it mattered, and the customer is still absorbing the failure while you improve your view of it.
Why focus on customer cost instead of MTTR and incident counts?
- Because the most expensive damage is the hardest to measure. Time on the bridge is visible; lost customer confidence, higher support volume, harder renewals, and eroded trust are not. Customer confidence is expensive because you usually do not get a clean invoice for losing it.
How does this connect to boring software?
- This is boring software in incident form. Boring production is not just software that does not break, it is software that gives you safe moves when it does: rollback, flags, kill switches, known-good paths, and graceful degradation. The point is not that nothing ever goes wrong, but that when something does, the customer does not have to wait inside your debugging loop.