Imagine waking up one morning to find that millions of people can’t access the internet—not because of a hack, not because of a server crash, but because someone changed the order of items in a list. A list that, according to official specifications, wasn’t supposed to have a meaningful order in the first place.

This actually happened. And the story behind it reveals something profound about how the internet works—or more accurately, how fragile the assumptions holding it together really are.

The Restaurant Menu Nobody Wrote Down

Let me start with an analogy that’ll make this easier to grasp.

Imagine a restaurant where the menu has always listed entrées in the same order: pasta first, then chicken, then fish, then steak. The menu never explicitly stated that this order matters—it was just how the chef liked to organize things. But over decades of operation, the kitchen staff accidentally started relying on this ordering.

Cooks mentally associated “position 1 = pasta station,” “position 2 = grill station,” “position 3 = fish station,” and so on. New hires learned this pattern through observation, not written procedures. Nobody documented it as a rule because it seemed like an implementation detail that shouldn’t matter.

One day, a new manager takes over and decides to reorganize the menu, putting the most popular item first to highlight it. The manager thinks this is purely cosmetic—after all, each menu item still has a name and description, so servers and cooks should be able to find what they need regardless of order.

But when the new menus roll out, chaos ensues. Orders go to the wrong stations. The pasta station receives steak orders and doesn’t know what to do. The grill station expects chicken orders but gets pasta orders instead. Service grinds to a halt.

The menu order was never supposed to matter. But through years of informal practice, it became a critical dependency that the entire kitchen operation relied on, even though nobody wrote it down or consciously designed it that way.

That’s exactly what happened with DNS record ordering.

What DNS Actually Does

Before we dive into what went wrong, let’s talk about what the Domain Name System (DNS) actually does.

When you type “google.com” into your browser, your computer needs to translate that human-readable name into a numerical IP address—something like 142.250.185.78. That’s what DNS does. It’s essentially the internet’s phone book, converting names into numbers.

But here’s the interesting part: DNS doesn’t just return a single IP address. For popular websites that need to handle millions of visitors, DNS typically returns a list of IP addresses. Each address points to a different server that can serve the same website.

Query: What's the IP address for example.com?

DNS Response:
- 192.0.2.1
- 192.0.2.2
- 192.0.2.3
- 192.0.2.4

This list serves multiple purposes. It provides redundancy—if one server is down, clients can try another. It enables load balancing—spreading traffic across multiple servers. And it allows for geographic optimization—directing users to the nearest server.

The Assumption That Broke Everything

Here’s the critical detail: the DNS specification explicitly states that the order of IP addresses in this list shouldn’t matter. Client software is supposed to be able to use any address in the list.

In practice, DNS servers often rotate the order of addresses (a technique called “round-robin DNS”) to distribute load. One client might see the list ordered as [A, B, C, D], while another sees [C, D, A, B]. This is entirely legal and has been part of DNS design since the beginning.

But something insidious happened over decades of DNS operation. Through years of informal practice, most DNS infrastructure providers maintained relatively stable record ordering. They didn’t promise to keep the order consistent—they just happened to. It was an implementation detail that seemed harmless.

And countless pieces of client software—routers, smart TVs, IoT devices, network appliances, old operating systems—quietly developed an undocumented assumption: they started always using the first IP address in the list and ignoring the rest.

Nobody explicitly coded this behavior as a rule. It just emerged naturally: “Why bother with the complexity of trying multiple addresses when the first one always works?” The engineers who wrote this code probably never even consciously thought about it as an assumption worth documenting.

The Day Everything Broke

When Cloudflare, one of the internet’s major infrastructure providers, decided to optimize their DNS infrastructure by changing how they ordered DNS records, they were doing something completely valid according to DNS specifications. They wanted to improve performance by strategically ordering addresses to guide clients to better servers.

From a technical standpoint, this was perfectly legal. The DNS specification never guaranteed stable record ordering. Cloudflare was playing by the rules.

But suddenly, millions of users couldn’t access websites. Not all users—which made it even more confusing—just users whose devices contained software with that hidden assumption about always using the first DNS record.

A router firmware written in 2008, sitting in someone’s home office, would suddenly break. A corporate network appliance that hadn’t been updated in years would fail. A smart TV’s network stack would stop working. All because they expected the first DNS record to be “the correct one,” even though that was never architecturally valid.

The change didn’t break well-written software that properly implemented DNS specifications. It broke software that had been quietly relying on an undocumented behavior for years—software that worked perfectly fine until someone changed what “shouldn’t” matter.

Why This Keeps Happening

This incident illustrates a broader principle about complex systems: implicit assumptions and undocumented behaviors can become architectural dependencies over time.

The technical term for this is “Hyrum’s Law,” which states: With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

In other words, if your system behaves a certain way—even if you never promised or documented that behavior—someone, somewhere will write code that depends on it. And when you change that behavior, even if the change is technically correct, their code will break.

The Hidden Cost of Compatibility

This is why internet infrastructure is so conservative. Engineers can’t simply “fix” inefficiencies or modernize systems, even when changes are technically legal and would improve performance, because they don’t know what invisible dependencies might break.

It’s like discovering that a building’s foundation has been slowly sinking for 30 years, but all the interior walls and plumbing have adapted to the tilt. Now you can’t fix the foundation without potentially destroying everything built on top of it.

The internet is held together by countless unwritten assumptions buried in old software that nobody remembers exists. Each assumption creates a potential point of failure that won’t surface until someone changes what “shouldn’t” matter.

The Brittleness of “Good Enough”

What makes this particularly interesting is that the broken software wasn’t completely wrong—it was just not robust.

Using the first DNS record in a list works most of the time. It’s simpler to implement. It requires less code, less testing, and less complexity. For years, it was “good enough.”

But “good enough” in software engineering often means “works under current conditions.” When conditions change—even in ways that were always theoretically possible—“good enough” can suddenly become “completely broken.”

The engineers who wrote that simplified DNS client code probably thought they were being pragmatic. They saw that the first record always worked, so why add complexity? They couldn’t have known that a decade later, their assumption would become a critical dependency affecting millions of users.

What DNS Records Actually Look Like

To understand this better, let’s look at what DNS responses actually contain. When a DNS server responds to a query, it includes several sections:

;; ANSWER SECTION:
example.com.    300    IN    A    192.0.2.1
example.com.    300    IN    A    192.0.2.2
example.com.    300    IN    A    192.0.2.3
example.com.    300    IN    A    192.0.2.4

Each line is an “A record” (Address record) that maps the domain name to an IPv4 address. The number 300 is the TTL (Time To Live), telling clients how long to cache this information.

According to DNS specifications (particularly RFC 1035, which defines DNS), the order of these records carries no semantic meaning. Clients should be prepared to receive them in any order and should ideally try multiple addresses if the first one fails.

But here’s what many DNS clients actually do:

// Simplified example of broken DNS client logic
function resolveHost(hostname) {
  const records = queryDNS(hostname);

  // Assumption: the first record is "the" address
  return records[0];

  // Never tries the other addresses
  // Never implements failover
  // Never considers that the order might change
}

This code works perfectly—until the day the DNS server changes the record order.

The Ripple Effects

The DNS record ordering incident didn’t just affect individual users. It revealed systemic brittleness in internet infrastructure:

IoT Devices and Embedded Systems

Many IoT devices—smart thermostats, security cameras, connected appliances—contain minimal DNS client implementations. These devices often can’t be easily updated, meaning the broken assumptions are effectively permanent. When DNS record ordering changed, thousands of IoT devices lost connectivity, and in many cases, the only fix was physically replacing the device.

Enterprise Network Appliances

Corporate firewalls, load balancers, and network monitoring tools sometimes contain DNS client code written years ago. These appliances might be business-critical infrastructure that companies are reluctant to update due to stability concerns. A DNS ordering change could suddenly break internal corporate networks.

Mobile Applications

Some mobile apps bundle their own DNS resolution code rather than using the operating system’s DNS client. If that code makes assumptions about record ordering, a DNS infrastructure change could break the app for millions of users—even if the rest of the device works fine.

The Update Problem

The most insidious aspect is that fixing this problem requires updating all the broken client software. But some of that software is running on devices that are difficult or impossible to update. Some devices are no longer supported by their manufacturers. Some are in remote locations. Some are embedded in critical infrastructure where updates are risky.

Why This Matters Beyond DNS

The DNS record ordering incident is particularly important because it exemplifies a much broader challenge in software engineering and system design: the problem of emergent dependencies.

When you build a system, you make explicit guarantees about how it will behave. These guarantees form a contract with users of your system. But your system also has countless behaviors that you never explicitly promised—implementation details, side effects, performance characteristics, ordering properties, and timing behaviors.

If your system is successful and widely used, people will inevitably build on top of those undocumented behaviors. Not maliciously, but pragmatically. Because it works. Because it’s simpler. Because they didn’t know better. Because the documentation was incomplete.

Years later, when you want to improve your system or fix a bug or optimize performance, you discover that you can’t change those undocumented behaviors without breaking things. Your implementation details have become de facto specifications, locked in by the weight of all the code that depends on them.

The Engineering Dilemma

This creates a genuine dilemma for infrastructure providers:

Option 1: Never change anything. Maintain backward compatibility forever, even for undocumented behaviors. This keeps existing systems working but prevents optimization, security improvements, and architectural evolution. Your infrastructure gradually becomes a museum of historical accidents.

Option 2: Change things carefully and deal with breakage. Make necessary improvements even though they’ll break systems that depended on undocumented behaviors. This enables progress but causes real pain for real users whose devices and applications suddenly stop working.

Option 3: Comprehensive communication and gradual migration. Announce changes well in advance, provide migration tools, and phase transitions slowly. This is the ideal approach but requires enormous coordination and still won’t reach every affected system.

Cloudflare ultimately chose option 2, deciding that the long-term health of the internet infrastructure was worth the short-term pain of exposed assumptions. Other providers might choose differently.

What Robust DNS Clients Look Like

So what should DNS client software actually do? Here’s a more robust approach:

// Better DNS client implementation
function resolveHostRobust(hostname) {
  const records = queryDNS(hostname);

  // Try addresses in order, but don't assume order is meaningful
  for (const address of records) {
    try {
      const connection = connectTo(address);
      return connection;
    } catch (error) {
      // First address failed, try the next one
      continue;
    }
  }

  // All addresses failed
  throw new Error('Could not connect to any address');
}

This implementation:

  • Doesn’t assume the first address is special
  • Tries multiple addresses if the first fails
  • Handles cases where DNS order changes
  • Implements proper failover behavior

It’s more complex than just using records[0], but that complexity provides resilience against DNS infrastructure changes.

The Lesson for Software Engineering

The DNS record ordering incident teaches us several important lessons about building software:

Document Your Assumptions

If your code assumes something about external systems—even if it seems obvious—document it. Future maintainers need to know what your code depends on, especially when those dependencies are subtle.

Implement Specifications, Not Behaviors

When integrating with external systems, implement what the specifications promise, not what the current behavior happens to be. Today’s behavior might be an implementation detail that could change tomorrow.

Design for Change

Assume that anything that can change will change. DNS record ordering can change. API response structures can add new fields. Network latencies can vary. Build your code to handle these variations gracefully.

Test Uncommon Cases

Many bugs hide in code paths that rarely execute. What happens when the first DNS record is unreachable? What happens when records appear in unexpected orders? Testing these scenarios reveals brittleness before it breaks in production.

The Internet’s Fragile Harmony

The internet works remarkably well most of the time. We can video chat with people on the other side of the planet, stream movies instantly, and access billions of web pages with barely a thought.

But beneath that smooth surface lies extraordinary complexity held together by a mix of formal specifications, informal conventions, and unspoken assumptions. The DNS record ordering incident pulled back the curtain on one such assumption—one that millions of systems depended on without anyone realizing it was a dependency at all.

The internet is both more robust and more fragile than most people imagine. It’s robust because it was designed with redundancy, flexibility, and adaptability. It’s fragile because decades of accumulated assumptions have created hidden dependencies that even experts don’t fully understand.

Moving Forward

So what happens now? The DNS record ordering incident has made infrastructure providers more cautious about seemingly innocent changes. It’s prompted discussions about how to better document behavioral expectations. It’s led to efforts to identify and fix brittle DNS client implementations.

But the deeper problem remains: complex systems evolve in ways that create emergent dependencies. We can’t fully prevent this phenomenon—it’s an inherent property of systems with many users and many years of operation.

What we can do is:

  • Build software that follows specifications rather than observed behaviors
  • Test our code against variations we might not see in normal operation
  • Document assumptions explicitly, especially ones that seem obvious
  • Communicate changes carefully when modifying widely-used infrastructure
  • Accept that some breakage is inevitable when improving systems

The internet will continue to reveal hidden assumptions. New incidents will expose new dependencies we didn’t know existed. Each incident teaches us something about the invisible complexity we’ve built into our infrastructure.

The Bigger Picture

The story of DNS record ordering is ultimately about how human systems evolve. We build tools and infrastructure with certain assumptions. Those tools work well, so people build on top of them. Over time, the accumulated weight of what’s been built constrains what we can change, even when changes would be improvements.

This happens in software. It happens in physical infrastructure. It happens in organizations and societies. Every system accumulates “technical debt” in the form of historical decisions that seemed reasonable at the time but constrain future possibilities.

Understanding DNS record ordering fragility helps us appreciate not just the internet’s technical architecture, but the broader challenge of managing complex systems over time. It reminds us that “technically correct” and “practically deployable” are different things.

And it teaches us humility: even something as fundamental and well-understood as DNS—technology that’s been working reliably since the 1980s—can surprise us with hidden assumptions that only become visible when someone changes what “shouldn’t” matter.

The next time your internet mysteriously stops working and then mysteriously starts working again, remember: somewhere in the vast complexity of internet infrastructure, someone probably just changed the order of a list.