What Major Incident Command Actually Buys You
A SEV at company scale isn’t a bigger version of an outage on a small team. It’s a different problem. Once you have dozens of services on the page, multiple product surfaces affected, regulators watching, executives asking questions, and customer comms running in parallel with engineering response — the thing that breaks first is coordination, not engineering capability.
This post is about what the IMAG / Incident Command System framework actually buys you in that environment, where I’ve seen it get misapplied, and the part most teams skip: making the post-mortem loop close.
Why ad-hoc breaks
The default mode of incident response at most companies is “everyone joins the bridge and starts debugging.” That works at small scale. It does not work when:
- The set of people qualified to debug exceeds the size of a useful Slack channel
- Ops, comms, and product engineering need to be doing different things in parallel without stepping on each other
- The information one team needs to act is sitting in another team’s heads
- An exec joins the bridge and unintentionally redirects investigation by asking a high-priority question
What you get is a loud, smart, well-intentioned room full of people whose collective output is less than any single one of them debugging in isolation. MTTR climbs. Customer comms get stale. The post-mortem reveals that the engineers who could have fixed it in 10 minutes were on the bridge for 90 minutes answering questions.
IMAG in one paragraph
The Incident Management Assistance Group model — borrowed from emergency response (ICS) — is just the recognition that during a major incident, the work is fundamentally split across roles. Incident Commander owns the response and makes decisions. Operations does the actual technical work to mitigate. Comms owns customer-facing and internal updates. A Scribe captures the timeline. The IC is not the smartest engineer in the room. The IC is the person whose only job is to keep the response moving.
The framework is mundane on paper. The discipline is not.
The IC’s job is not to fix the issue
The most common failure mode I’ve seen — and one I’ve made myself early on — is the IC quietly drifting into being an executor. Someone says “I think it’s the cache layer,” the IC says “let me check,” and now the IC is heads-down in Datadog while the response loses its coordinator.
The IC’s actual job during a SEV:
- Decide what to investigate next when ops gives multiple candidate hypotheses
- Decide when to escalate — to leadership, to other on-calls, to legal/comms
- Decide when to roll back vs keep investigating, and own that decision
- Hold the timeline so when the question “what did we already try” comes up six times, the answer is already on the screen
- Protect the responders from drive-by exec questions and redirect those to the comms lead
If the IC is mitigating, no one is doing the IC’s job. And the moment a second hypothesis emerges, the response stalls.
Comms is a load-bearing role, not a courtesy
Customer-facing status updates, internal exec updates, and cross-team awareness are three different audiences with three different cadences. None of them tolerate silence. None of them are served by waiting until “we know more.”
A comms lead who is good at this writes the update before the next milestone instead of after it. They translate the ops-language status into customer-language status without losing accuracy. They push back on the IC when an update is too speculative or too vague. And critically, they hold a line on when not to escalate to leadership — because every escalation is an interruption to ops, and not every five-minute status change is escalation-worthy.
In a mature org, the comms lead is roughly as senior as the IC and not afraid to disagree with them in the war room.
Blameless post-mortems that close the loop
Most companies write post-mortems. Far fewer companies actually act on them.
The failure mode is structural: post-mortems generate remediation items, those items go into a tracker, and the tracker is owned by no one with authority to insist they happen. Six months later the same class of incident recurs and the post-mortem cites the prior post-mortem.
The fix isn’t more rigorous writing. The fix is treating remediation as scheduled engineering work — budgeted into team plans, tracked alongside feature work, with named owners and dates. When a post-mortem produces a remediation item, that item competes for next-sprint capacity exactly like a feature would. If it loses that competition, that’s a decision — surfaced to leadership, with the tradeoff visible — instead of a quiet drift into never being done.
When the loop closes, the same incident class doesn’t recur. When it doesn’t close, you’re paying for the post-mortem twice.
Reliability as a scheduled outcome
The deeper point underneath all of this: incident response is the visible part of reliability work, but the invisible part is whether reliability is treated as a scheduled outcome or a recurring fire drill.
A scheduled outcome means SLO/SLI design before launch, production-readiness reviews with teeth, observability built for on-call to detect issues before customers do, and a remediation pipeline that closes. A fire drill means heroics on Tuesday, a post-mortem on Thursday, and a slightly different version of the same incident in three weeks.
IMAG is not the framework that produces reliability. It’s the framework that lets you respond well when the underlying reliability work hasn’t been done yet — and gives you the data you need to do that work.
Currently interviewing for senior SRE / infrastructure roles. If you want to talk incident response, observability, or reliability program design, I’m at joesindel@gmail.com.