What DBaaS Reliability Actually Means to Measure

Nobody files a support ticket to say the database was up all day. 

Users don't praise reliability. They don't notice it at all — until it's gone. And when it goes, they don't think "the SLO was breached." They think "I lost data" or "my app was down." They think it twice, and then they start looking for alternatives.

That asymmetry is especially sharp in database products, where reliability isn't one thing — it's several, and they can fail independently.

Not a Single Dial

When teams talk about database reliability, availability usually dominates the conversation. Is the endpoint reachable? Is the cluster healthy? Can we fail over?

Those questions matter. But availability failures have a particular character: they're visible, they're shared, and recovery brings relief. An outage is a trauma event. Everyone knows it's happening, everyone mobilizes, and when the cluster comes back, there's a collective exhale. It gets a postmortem. It gets fixed.

The failures that actually move application teams toward alternatives are usually quieter.

A replica that fell behind and served stale reads for six hours before anyone noticed. A backup that completed successfully according to the logs, but couldn't restore cleanly. A write that returned success and then disappeared during a node failover because the durability guarantee wasn't what the user assumed.

These don't generate incident pages. They generate doubt. And doubt accumulates differently than outages — it doesn't resolve with a status page update. It compounds, quietly, until an architect is making the case to their team that the switching cost is worth it.

Three Dimensions, One Instrumentation Problem

In a DBaaS, reliability has at least three dimensions worth tracking separately. They require different instrumentation — and only one of them comes for free:

Availability — Can users connect and execute operations? Uptime percentage, connection success rate, error rate by type, p99 query latency. Most monitoring stacks cover this reasonably well out of the box. The risk isn't under-measurement — it's that availability dominates SLA conversations, creating a false sense of coverage for the dimensions below.

Data consistency — Are reads reflecting the state users expect? Nothing in your default observability stack will tell you this. Replication lag is the most common proxy, but lag alone doesn't tell you whether stale reads are reaching users or how stale they are when they do. Meaningful measurement requires intention: tracking read-after-write success rates on replicas, monitoring lag distribution rather than averages, or running sentinel writes — a known value written to primary and read back from replicas on a schedule, with drift tracked over time. None of this happens unless someone decides to build it.

Durability — When a write is acknowledged, does it survive? Standard instrumentation tells you whether the pipeline is functioning, not whether the guarantee held under real failure conditions. Genuine confidence in durability requires deliberate testing: regular failover drills that verify no acknowledged writes were lost, restore tests that confirm backups recover to a consistent state, fault injection that exercises edge cases in your replication acknowledgment path. Point-in-time recovery isn't a durability guarantee until someone has exercised it under conditions that resemble production. That test rarely gets scheduled unless it's treated as a standing commitment.

The Gap is a Choice

Availability metrics exist because the tooling ships with them. Consistency and durability signals exist only if your team decided they were worth building.

These dimensions interact, but they don't move together. A system can be highly available while serving inconsistent reads. It can acknowledge writes that won't survive a failure event. The dials are separate, and the gaps between them are where trust erodes — not in a single event, but across a hundred small ones that never made it into a postmortem.

In practice, that decision to instrument gets deferred until the absence of those signals shows up in a customer conversation nobody wanted to have.

Comments

Popular posts from this blog

AWS Re:Invent 2024

Tariffs are bad for you.