Observability Is Instrumentation

Effective site reliability engineering demands two distinct skills that rarely get named together: the ability to act decisively under pressure with incomplete information, and the ability to extract meaning from data that wasn't designed with you in mind.

The first is the ER physician's skill. Triage doesn't wait for complete information. You're reading vitals, interpreting presenting symptoms, and making consequential decisions in compressed time. Speed matters, but so does pattern recognition. The wrong call isn't just inefficient — it can make things worse.

The second is the data scientist's skill. The raw data exists. The challenge is knowing which of it means something. Signal extraction requires understanding not just what the instruments are reporting, but why they were placed there, what assumptions they encode, and where their blind spots are. A metric you don't understand is noise with a label on it.

Effective SRE work requires both modes — sometimes simultaneously. You're triaging under pressure while also asking whether the telemetry you're relying on is actually telling you what you think it is. That tension doesn't resolve cleanly. It has to be managed.

Most SRE conversations stop at tooling. The harder conversation is about instrumentation — what gets measured, why, and whether the people responding to alerts have enough context to act on what they're seeing.

Signal Extraction Is a Skill, Not a Feature

Receiving telemetry is not the same as understanding it.

When something goes wrong at 2am, the on-call engineer is working with whatever instrumentation exists — metrics, logs, traces — and trying to construct a coherent picture of a system they may not have built. If the instrumentation was designed with a particular failure mode in mind, and that's not the failure mode occurring, the data can actively mislead.

It's not enough to know that latency spiked. You need to know which layer it spiked in, whether it's correlated with a deployment, a traffic pattern, or something upstream — and whether the metric you're looking at actually captures what you think it captures.

That last part matters more than it sounds. Metrics can be technically correct and operationally useless. A p99 latency metric tells you something degraded for someone. It doesn't tell you whether it was a hot partition, a lock wait, a slow query plan, or a network hiccup. Each of those has a different remediation path. The ability to reason through that distinction, under pressure, is the skill. The telemetry is just the raw material.

Instrumentation Is a Development Activity

This is where the signal extraction problem originates.

Instrumentation decisions happen at write time — when a developer is building a feature, adding a service, or modifying a data path. They're the ones who know what this code is supposed to do, what failure looks like, and which operations are worth tracking. That context is rarely written down anywhere useful.

What gets emitted is a function of what the developer thought to instrument. That's not a criticism — it reflects a real constraint. You can't instrument what you haven't reasoned about yet. But it does mean that the coverage of your observability stack is shaped by a series of individual decisions made under deadline pressure, with no guarantee of coherence across teams or time.

The result is an instrumentation layer that reflects development intent, not operational need. Those aren't always the same thing. And the on-call engineer inherits whatever gap exists between them.

Database Layers Make This Harder

Database instrumentation is a category of problem that deserves its own treatment, because the persistence layer behaves in ways that aren't intuitive — even to experienced infrastructure engineers.

Consider a few common gaps. Query latency metrics tell you a query was slow. They rarely tell you why — whether the plan changed, whether statistics are stale, whether an index stopped being used. Wait events and lock contention are frequently unmonitored or underreported, which means lock-related degradation looks like generic slowness until someone digs into engine internals. Persistence layer behavior — write amplification, flush behavior, durability tradeoffs under load — is often invisible unless explicitly instrumented, and it's rarely something application developers think to surface.

The result is that database-related incidents tend to run longer than they should. Not because the data isn't there, but because the instrumentation wasn't designed with operational interpretation in mind, and the person on call doesn't have the context to ask the right questions of what exists.

A Partial Answer

Full remediation here is a longer conversation — one worth having with both your development and SRE practices. But a starting point is recognizing that instrumentation needs a dual audience: the developer who creates it and the operator who interprets it.

In practice, that means instrumentation decisions should carry intent. Not just what is being measured, but what question this metric is meant to answer and what it does not capture. Even a short annotation in a runbook, tied to a specific metric or alert, meaningfully changes the operational picture for someone who wasn't in the room when the code was written.

This is especially true for database instrumentation, where the gap between what's emitted and what's needed to diagnose is widest. The people who understand the persistence layer deeply are rarely the same people who respond to incidents. Closing that gap requires deliberate handoff, not better dashboards.

Observability isn't a dashboard problem. It isn't even purely a tooling problem. It's a knowledge transfer problem — and instrumentation is where that transfer either happens or doesn't.

Comments

Popular posts from this blog

AWS Re:Invent 2024

Tariffs are bad for you.