← Browse essays

The Price of Vigilance

Why confidence isn't open source.


Reading production postmortems changed how I think

Over the past few years, I have developed the habit of running through production postmortems. AWS, Cloudflare, Azure, GitHub, Stripe. If a distributed system fails in public and the engineering team takes time to explain why, I will usually spend some time going through the write-up.

I do not always understand everything, but my goal is to understand how large systems scale. How they handle millions of requests per minute. How they partition data. How they survive traffic spikes.

The thing that has stood out to me, though, is that the incidents rarely began because the systems could not handle more traffic. They usually began with something smaller. Even trivial. A configuration change that behaved differently than expected. A dependency change that behaved differently. An operation that exposed an assumption nobody realized they were making.

Different companies. Different technologies. Different engineering cultures. Same patterns.

The first domino always looked surprisingly familiar.

This also forced me to question something I had quickly assumed for years. Maybe I had been thinking about production systems the wrong way.

The pattern

What stuck with me is not that the incidents happened. Complex systems fail and that is part of software at scale.

What stuck with me was how they failed. Across companies. With the best of the engineering talent.

It almost always started with unstated assumptions. Configurations. Failover mechanisms. Maintenance operations with a large blast radius. None of these problems were solved by adding more servers or compute.

They required understanding how a growing system behaves when its assumptions no longer hold.

That distinction kept bothering me. For years, I casually grouped these incidents under a single mental model called "scaling problems." But the more postmortems I read, the less accurate that label felt.

Scale was certainly part of operating a large system. It just did not seem to be the thing that was causing the outages I was reading about.

This realization sent me looking for a better mental model.

Two different problems

I had been using the word scale to describe two very different engineering problems.

The first one is familiar. Can the system handle more. More users. More requests. More data. More traffic.

This is the engineering problem we usually mean when we talk about scaling. It is about increasing capacity while maintaining performance. Over the last two decades, our industry has become remarkably good at solving it. We know how to shard databases, add replicas, distribute traffic, scale horizontally.

The second problem asks a completely different question. What happens when reality is worse than we expected.

A configuration behaves differently in production. A dependency fails. The failover mechanism does not fail over. More hardware does not solve these problems. They require understanding how an increasingly complex system behaves when its assumptions are violated.

That is when I found it useful to separate these two ideas.

Scale is about handling more. Resilience is about surviving worse.

The distinction seems small but it changes how you read these production incidents. This is a flip you cannot unflip.

Capacity problems usually become more predictable as systems mature. We accumulate playbooks, patterns, and decades of engineering experience around them.

Resilience, on the other hand, does not improve in quite the same way.

Every dependency, automation, integration, and safety mechanism we add makes the system more capable. It also creates new interactions and patterns we have not seen before. The number of ways a system can surprise us keeps growing.

Looking back at these postmortems, I realized I have not been reading stories about systems that could not scale. I have been reading stories about systems encountering reality in ways no one ever anticipated.

Another puzzle

Once I started separating scale from resilience, another question began bothering me. At first, it seemed completely unrelated. One question was about engineering, the other was about economics.

The engineering question was straightforward. If production systems spend so much of their time dealing with unexpected interactions instead of raw capacity, then resilience deserves far more attention than we usually give it.

The economic question was harder.

Why do companies routinely pay significant premiums for managed databases, search engines, and messaging systems when high-quality open source alternatives already exist?

On paper, the comparison often looks difficult to justify.

The software is frequently the same. The APIs are compatible. The underlying capabilities are remarkably similar. Yet organizations continue to choose the managed offerings. For a long time, I assumed the answer was convenience. Maybe teams simply did not want to operate the software themselves. But the more production postmortems I read, the less satisfying that explanation became.

Convenience might explain the first week. It does not explain why some of the world's largest engineering organizations are willing to spend millions of dollars every year on managed infrastructure.

There had to be something more valuable being exchanged here. If resilience was genuinely harder than scale, maybe that was what customers were actually paying for.

I was not sure. But it gave me somewhere to look.

Not at pricing pages. But at product pages.

Reading the product pages

I started with a simple exercise. Sometimes manually browsing is more helpful than agents, at least for me so far.

I opened the product pages for a few managed services. MongoDB Atlas, Amazon RDS, others. I was not looking at pricing or benchmarks. I simply wanted to understand what these companies believed was worth talking about. At first glance, the pages looked exactly as I expected. Managed databases. Managed messaging. The software itself was not particularly surprising.

Then I started reading the feature lists more carefully. They talk about backups, point-in-time recovery, managed failovers, monitoring, observability, security, cross-region replication, compliance, and encryption. Very little of the marketing was about making the database fundamentally better. Almost everything was about operating it reliably. That observation stopped me for a moment.

I had been thinking of these products as managed databases. The product pages were describing something else entirely. They were describing an operational system built around a database. The database was simply the center of that system.

Suddenly the postmortems started making more sense. The failures I had been reading about were rarely failures of the database itself. They were failures of the operational systems surrounding it. The same system these managed platforms were investing most of their engineering effort into. That was the connection I had been missing.

Managed platforms are not primarily monetizing software. They are monetizing the continuous operational attention required to keep that software reliable. Customers experience that investment differently. They experience it as confidence. Confidence that upgrades will not become incidents, that backups will work, that failovers will happen.

The software was the same. The operational system around it was not.

The market eventually noticed

Once I looked at managed platforms through this lens, a series of industry decisions suddenly started making more sense.

Several infrastructure companies changed how they licensed their software. At the time, those decisions sparked debates about open source, cloud providers, and licensing models.

Looking back, I think they were all revealing the same shift. Software itself had become widely available. Operating that software reliably at scale had not.

The economic value was not accumulating inside the database, the search engine, or the cache anymore. It was accumulating in the operational system surrounding them.

The companies building these products seemed to reach the same conclusion. The asset they were protecting was not simply the source code. It was the managed platform. The operational expertise behind it. The confidence that customers associated with running it in production.

Whether you agree with every licensing decision is almost beside the point. The interesting part is that they all point toward the same shift. As software matured and became easier to obtain, differentiation moved somewhere else.

It moved into continuous operation.

Into resilience.

Into the people and systems that quietly absorb operational complexity before customers ever notice it.

Choosing who owns the failure modes

This is not an argument for managed services. It is an argument for being explicit about ownership.

Every production system has failure modes. Some are about capacity. Many are about operations. None of these disappear because you have open source software or you choose a managed platform either way. The only thing that changes is who owns them.

That is a more useful question than asking whether managed services are "worth it."

Start by mapping your realistic failure modes. Which ones are simply a matter of adding capacity. Which ones require years of operational experience, mature tooling, or people who have seen failure before.

Once you answer these questions, the technology choice may become much clearer.

Some organizations will decide that operating these systems is a strategic capability worth building in-house. Others will decide their advantage lies somewhere else.

Neither answer is universally right. The mistake is assuming you are only choosing software. You are also choosing who carries the operational burden that surrounds it.

Production postmortems taught me something I was not looking for.

I thought I was studying how large-scale systems work. What I actually studied was how they survive. That changed how I think about managed infrastructure and where the value sits.

Now I no longer see managed platforms as companies selling databases, search engines, or message queues. I see them as organizations making a continuous investment in operational attention.

Most customers will never see that work. And that is the whole point. They experience something much more valuable: operational confidence. Confidence that things will work when everything that can go wrong will go wrong.

Open source gives us remarkable software. Managed platforms give us confidence that the software will still be running tomorrow.

Customers do not buy operational attention. They buy the confidence that the operational attention creates.

The software may be open source. Confidence isn't.


References

A few of the postmortems in the back of my mind while writing this:

A few of the license changes referenced in the second half:

The most comprehensive public collection I know of is danluu/post-mortems. Hundreds of write-ups indexed by category, all linked, all public.

I'm choosing the next essay Friday. Vote on what's next.

Study Notes
TL;DR

Scale is about handling more. Resilience is about surviving worse. Managed services sit in the gap, and the value they keep is the operational attention no software license can ship.