Your Dashboard Is Not Observability
A green dashboard and a healthy database are not the same thing. That distinction sounds obvious until you're forty minutes into an incident where every panel looks nominal and your storage layer is silently throttling writes for a specific tenant on a specific shard.
That scenario is not hypothetical. It's the shape of the failures dashboards are structurally unable to surface — not because the tooling is bad, but because dashboards encode what you already know to look for. Observability is the capability to answer questions you haven't thought to ask yet. Those are different things, and treating one as the other leaves you blind at exactly the moment it matters most.
Here's the structural problem: dashboards encode past hypotheses. That replication lag panel exists because you once had a replication lag problem. That connection pool chart was built after a connection exhaustion incident. The query throughput tile reflects the questions your team had during whatever postmortem prompted its creation. These panels are a record of known failure modes that got resolved. Novel failures — by definition — are outside that set.
In practice, dashboard-driven triage works well for known operational patterns. A spike in deadlock events, a checkpoint frequency anomaly, a sudden jump in temp space consumption — if you've seen it before and you've instrumented it, a dashboard can get the right person looking at the right thing quickly. That's real value. Work queues, escalation routing, SLA tracking — dashboards support all of that reasonably well when the failure mode is familiar.
But that workflow has a ceiling. When a new indexing strategy behaves unexpectedly under a specific query shape, or a schema migration interacts badly with a workload pattern that only occurs in one account's data distribution, or a connection pooler starts queuing in a way that only manifests above a particular concurrency threshold — none of that shows up cleanly on a panel built from last year's known problems. The dashboard doesn't lie. It just only knows what you knew when you built it.
Real observability means your database layer is instrumented with enough context that you can ask arbitrary questions at the time of failure. In practice, this means high-cardinality event data: query execution traces with plan details, wait event breakdowns sliced by database, schema, workload type, or even specific parameter patterns. It means when something fails in a novel way, you can interrogate the failure — not just confirm or deny a hypothesis the dashboard already encodes.
And this is where intimate familiarity with each measure becomes the actual differentiator. Knowing that a spike in
log_waits is meaningfully different from a spike in buffer_busy_waits — and what each implies about where contention is actually occurring — is what turns telemetry into a next action. High-cardinality data without that fluency is noise. With it, you can cut MTTR significantly on novel failures, because you're reasoning about the system rather than pattern-matching against past incidents.The test I use: when a database failure mode appears for the first time, can your team explain why it failed — not just thatit failed? Can they identify whether it's a specific workload, a specific schema, a specific resource pressure? Or does diagnosis look like grepping logs and triangulating from indirect signals until something clicks? That gap — between "the alert fired" and "we understand the failure mode" — is where the dashboard dependency costs you.
More panels won't close that gap. Structured telemetry, high-cardinality instrumentation, and a team that knows what each measure actually means — that's what closes it.

Comments
Post a Comment