Your On-Call Rotation Is Writing Requirements. Nobody Is Reading Them.
That moment is worth examining, because it reveals something about how reliability work actually moves — or doesn't — through most technology organizations.
The on-call engineers usually have detailed, accurate knowledge of where the system is fragile. They know which failure modes the runbooks don't fully cover, which write patterns cause CPU plateaus the monitoring doesn't catch, which edge cases will eventually surface as customer escalations. That knowledge gets expressed in incident tickets and post-mortems. But post-mortems aren't backlog items. Incident tickets don't have product owners. So the knowledge exists, and planning doesn't find it, and the on-call rotation keeps absorbing the gap.
The Missing Input Channel
Most planning processes have well-worn channels for feature work. Product managers talk to customers, engineers write specs, roadmaps get built. Reliability requirements don't arrive through those channels. They arrive through pages and escalations and engineers quietly doing extra work to keep things stable — and by the time a reliability gap surfaces loudly enough to get roadmap attention, it's usually already cost something.
What's broken isn't the engineers, the post-mortem format, or the incident tooling. The input channel is missing. There's no regular path from "this failure mode keeps recurring" to "this is a Jira issue a product owner can review, refine, and prioritize."
Fixing that doesn't require a new process framework. It requires a decision that on-call observations are a legitimate source of product requirements, and then the lightweight infrastructure to act on it. In practice: a named owner who reviews on-call retros before each planning cycle and asks what failure modes the team is repeatedly absorbing that haven't been explicitly addressed. A reliability backlog alongside the feature backlog, populated with actual scoped issues — tasks, defects, whatever fits your taxonomy — not incident tickets dressed up as requirements, but work with an owner and a place in prioritization. And a shared norm that recurring pages on the same failure mode are a signal of unshipped work, not operational background noise.
The specific format matters less than the commitment behind it. A Jira task that names the failure mode, scopes the fix, and defines what done looks like is worth more than a detailed post-mortem that nobody acts on.
What This Doesn't Fix
A formal input channel doesn't resolve the underlying prioritization tension. Reliability work still competes with feature work, resources are still finite, and there will be quarters where known failure modes stay in the backlog because other things got sequenced ahead of them. What changes is the nature of that tradeoff. Instead of reliability debt accumulating invisibly — absorbed by engineers, invisible to planning — it becomes a visible choice. You can look at the backlog and say: we know about this failure mode, we've scoped the fix, and we're explicitly choosing to sequence it after these other things. That's a real tradeoff made deliberately, and it's meaningfully different from discovering the failure mode at 2am after a customer escalation that didn't have to happen.
The on-call team already knows what needs to be built. The question is whether your planning process is designed to hear them.

Comments
Post a Comment