Scaling Reads vs. Writes: Where Systems Tell the Truth

Most OLTP databases are predominantly read workloads. SELECT statements outnumber INSERT, UPDATE, and DELETE by a wide margin — sometimes by an order of magnitude. Users browse and display far more often than they create or modify. Read scaling gets attention first because read pressure arrives first, and the solutions are well-understood: replicas, caches, connection pooling. It works.

Writes arrive later and are more disruptive when they do. In the previous post on schema as interface, the argument was that your schema has clients who depend on certain access patterns being cheap. Writes are a new kind of client the original design often didn't account for. Indexes that accelerate reads add overhead to every write. Constraints that protect integrity can increase contention. The schema doesn't change. The workload does. And suddenly the interface is expensive in new ways.

Writes expose the actual shape of your system. Not the architecture diagram shape — the real shape. Where state lives. Who owns it. What happens when two things try to change it at the same time. You don't discover these things when read traffic increases. You discover them when writes do.

The Constraints That Only Writes Reveal

Contention is the first thing that bites you, before any architectural decisions are made. The signal shows up in infrastructure metrics first — CPU climbing without a corresponding increase in throughput, IO wait times spiking, transaction rates flattening while queue depth grows. It looks like a capacity problem, and the first response is usually to treat it like one. Resize the instance. Add more vCPUs. Increase IOPS. In an elastic cloud environment this is easy, and it often provides enough relief to close the incident. But the problem is still there, now running on more expensive hardware. Contention doesn't respond to capacity the way genuine resource exhaustion does — the bottleneck is serialization, not compute or disk. The bigger instance just means more threads waiting on the same lock.

The cause is usually narrow. Multiple transactions updating the same row — an order status column every fulfillment worker touches, a balance field every payment modifies, an updated_at timestamp written on every change regardless of what changed. Or a sequence object acting as a global bottleneck because every INSERT needs the next value before it can proceed. One table, sometimes one row, is holding everything else back.

Consistency and coordination are a separate class of problem — they don't surface from contention alone, but from the decision to move to a multi-writer architecture. With a single writer, consistency is free; there's nothing to reconcile. Introduce multiple writers — sharding, multi-region, active-active — and you have to make explicit decisions you were previously making implicitly. Strong consistency or eventual? Most teams answer this wrong the first time because the question felt theoretical until it wasn't. The answer also varies by entity: some data tolerates lag, some doesn't, and that nuance gets lost when the conversation happens under pressure.

Coordination is what you pay for that consistency decision. Distributed transactions, consensus protocols, saga orchestration — each enforces some guarantee across multiple writers at a cost that grows non-linearly with participants and distance. The contention problem was local and observable. This one is systemic and often only visible under specific failure conditions.

The Real Problem Is Usually the Write Path Itself

The infrastructure metrics are symptoms. The root cause is almost always schema design or unmanaged write paths — and the two compound each other.

Schema problems look like application behavior, not database behavior. Wide rows rewritten in full on partial updates. Status machines as nullable columns where every state transition touches the same row. Audit triggers firing on every change regardless of what changed. None of these were wrong at the time. Under write pressure, they become the bottleneck.

Unmanaged write paths are the application-side mirror. Transactions holding locks longer than necessary because they do real work — API calls, file operations, validation logic — inside an open transaction. Bulk operations updating thousands of rows without batching. Write patterns not designed with concurrency in mind because concurrency wasn't a concern when the feature was built. The database is doing exactly what it was told. What it was told is now expensive.

What Actually Helps

The highest-leverage fixes are usually in the application, not the infrastructure.

Optimizing write paths means auditing what your transactions are actually doing and tightening the boundaries. Move any work that doesn't need to hold a lock outside the transaction. Break large batch writes into smaller, bounded operations. Replace a single high-contention row — an order counter, a global status flag — with a design that distributes the write across multiple rows and aggregates on read. These changes are unglamorous and often require touching code that's been stable for years. They also tend to produce the most durable improvements.

Event-driven architectures are worth considering when writes don't need to be synchronous from the user's perspective. Writing to an append-only log — a Kafka topic, an outbox table, an event stream — doesn't contend with reads, and downstream consumers process at their own pace. The tradeoff is eventual consistency: other parts of the system may lag behind what was just written. That's acceptable for many workloads — notifications, audit trails, analytics, search indexes — and unacceptable for others. The discipline is knowing which is which before committing to the pattern.

If you adopt this model, lag becomes a metric you own. Consumer group lag in Kafka — the gap between the latest offset produced and consumed — is the most direct measure of how far behind your eventual state is. Track it per consumer group, alert when it exceeds a meaningful threshold, and distinguish between lag that's growing (a processing problem) and lag that's stable (an acceptable steady state). Outbox patterns have their equivalent: the age of the oldest unprocessed event. Eventual consistency isn't a property you declare and forget — it's a gap you measure and manage.

Idempotency is the design property that makes both of these approaches more resilient. If a write can be safely retried without producing duplicate side effects, you can simplify failure handling and remove coordination overhead that exists purely to prevent double-writes. It's significantly easier to build in than to retrofit.

When the Data Platform No Longer Fits the Use Case

Optimized write paths and event-driven patterns will take most systems further than teams expect. But there are workloads where the single-writer ceiling is real and the application-level options are exhausted. At that point the question isn't really architecture — it's platform. The current database is the wrong tool for the write pattern the system has evolved into.

Aurora PostgreSQL Global and SQL Server Always On both optimize for a single-writer model with strong consistency. They're well-suited to most OLTP workloads, but neither solves multi-region write distribution. Azure SQL Hyperscale raises the I/O ceiling for write-heavy SQL Server workloads without changing the engine's locking behavior — the right choice when throughput is the constraint, not contention. Aurora DSQL is the choice when you genuinely need active-active multi-region writes and your workload has low conflict rates; optimistic concurrency works well there and poorly under high contention.

The decision to change platforms is expensive and disruptive. It warrants serious consideration only after the application-level options are genuinely exhausted — not as a first response to write pressure.

The Final Word

Write scaling problems almost always arrive in the same order. First as an infrastructure alert you treat as a capacity problem. Then as a recurring incident that hardware doesn't fully resolve. Then as a conversation about schema, transaction design, and whether the application is doing things it doesn't need to hold a lock for. Most systems never need to go further than that. The teams that get into trouble are usually the ones who skip the middle step — who reach for a new database platform before they've understood what the current one is actually telling them.

The write path is where your system's assumptions live. Scaling it means surfacing those assumptions, not hiding them behind bigger instances or more capable platforms. That's unglamorous work. It's also usually the right work.


Series: Databases & Data Systems (Unsexy, Critical Work)

* Part 4: Scaling Reads vs Writes: Where Systems Tell the Truth

Comments

Popular posts from this blog

AWS Re:Invent 2024

Tariffs are bad for you.