Your On-Call Rotation Is Writing Requirements. Nobody Is Reading Them.

There's a specific moment most on-call engineers recognize. You're in a post-mortem, walking through a memory exhaustion failure that took longer to diagnose than it should have. Someone mentions they've seen this pattern before. Someone else says they filed a ticket after the last one. You go looking for the ticket. It's closed — marked resolved after the immediate fix, never connected to anything in the roadmap. The underlying failure mode is still there. You're just meeting it again.

That moment is worth examining, because it reveals something about how reliability work actually moves — or doesn't — through most technology organizations.

The on-call engineers usually have detailed, accurate knowledge of where the system is fragile. They know which failure modes the runbooks don't fully cover, which write patterns cause CPU plateaus the monitoring doesn't catch, which edge cases will eventually surface as customer escalations. That knowledge gets expressed in incident tickets and post-mortems. But post-mortems aren't backlog items. Incident tickets don't have product owners. So the knowledge exists, and planning doesn't find it, and the on-call rotation keeps absorbing the gap.

The Missing Input Channel

Most planning processes have well-worn channels for feature work. Product managers talk to customers, engineers write specs, roadmaps get built. Reliability requirements don't arrive through those channels. They arrive through pages and escalations and engineers quietly doing extra work to keep things stable — and by the time a reliability gap surfaces loudly enough to get roadmap attention, it's usually already cost something.

What's broken isn't the engineers, the post-mortem format, or the incident tooling. The input channel is missing. There's no regular path from "this failure mode keeps recurring" to "this is a Jira issue a product owner can review, refine, and prioritize."

Fixing that doesn't require a new process framework. It requires a decision that on-call observations are a legitimate source of product requirements, and then the lightweight infrastructure to act on it. In practice: a named owner who reviews on-call retros before each planning cycle and asks what failure modes the team is repeatedly absorbing that haven't been explicitly addressed. A reliability backlog alongside the feature backlog, populated with actual scoped issues — tasks, defects, whatever fits your taxonomy — not incident tickets dressed up as requirements, but work with an owner and a place in prioritization. And a shared norm that recurring pages on the same failure mode are a signal of unshipped work, not operational background noise.

The specific format matters less than the commitment behind it. A Jira task that names the failure mode, scopes the fix, and defines what done looks like is worth more than a detailed post-mortem that nobody acts on.

What This Doesn't Fix

A formal input channel doesn't resolve the underlying prioritization tension. Reliability work still competes with feature work, resources are still finite, and there will be quarters where known failure modes stay in the backlog because other things got sequenced ahead of them. What changes is the nature of that tradeoff. Instead of reliability debt accumulating invisibly — absorbed by engineers, invisible to planning — it becomes a visible choice. You can look at the backlog and say: we know about this failure mode, we've scoped the fix, and we're explicitly choosing to sequence it after these other things. That's a real tradeoff made deliberately, and it's meaningfully different from discovering the failure mode at 2am after a customer escalation that didn't have to happen.

The on-call team already knows what needs to be built. The question is whether your planning process is designed to hear them.

Comments

Popular posts from this blog

AWS Re:Invent 2024

Tariffs are bad for you.