Cloudflare's 'Fail Small' Initiative: Building a More Resilient Network for the Future

By ✦ min read

Over the past two and a half quarters, Cloudflare has undertaken an intensive engineering project internally known as "Code Orange: Fail Small". The goal was straightforward yet ambitious: make our infrastructure more resilient, secure, and reliable for every customer. Earlier this month, we completed the core work of this initiative. While network resiliency is never truly "finished"—it remains a continuous priority across our development lifecycle—this project specifically targeted the root causes that led to two global outages on November 18, 2025 and December 5, 2025. The completed work would have prevented both incidents entirely.

This effort focused on four main pillars: safer configuration changes, reducing the impact of failures, revising our "break glass" procedures and incident management, and introducing mechanisms to prevent drift and regressions. We also strengthened how we communicate with customers during outages. Below, we detail what we shipped and what it means for you.

Safer Configuration Changes

Before this project, many Cloudflare internal configuration changes propagated instantly across the network. Now, in most cases, changes no longer reach the entire network at once. Instead, they are rolled out progressively with real-time health monitoring. This allows our observability tools to detect problems and automatically revert changes before they affect your traffic.

Cloudflare's 'Fail Small' Initiative: Building a More Resilient Network for the Future
Source: blog.cloudflare.com

We identified high-risk configuration pipelines—such as those directly involved in the November and December outages—and built new tools to manage these changes more safely. For products that process customer traffic and receive configuration updates, we now require "health-mediated deployment" methodology for all configuration changes. This is the same approach we use for software releases, and it now applies to configuration as well. It was adopted by all product teams that were affected by the incidents, plus others.

Introducing Snapstone

Central to this improvement is a new internal component called Snapstone. Snapstone packages configuration changes into bundles and then releases them gradually, automatically checking health metrics at each stage. Before Snapstone, applying health-mediated deployment to configuration was technically possible but required significant, custom effort for each team. Consequently, the practice wasn't consistently applied across the network. Snapstone closes that gap by providing a unified system that brings progressive rollout, real-time health monitoring, and automated rollback to every configuration deployment by default.

What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, it allows teams to dynamically define any unit of configuration that needs health mediation—whether that's a data file (like the one that caused the November 18 outage) or a control flag in our global configuration system (as involved in the December 5 outage). Teams create these configuration units on demand, and Snapstone handles the rest.

Reducing the Impact of Failures

Even with safer deployments, failures can still occur. Code Orange also focused on limiting the blast radius when something goes wrong. We introduced architectural changes to isolate services so that a failure in one part of the network doesn't cascade to others. For example, we redesigned certain shared data stores and control planes to fail gracefully, ensuring that only a small subset of customers (or a single service) is affected.

We also implemented circuit breaker patterns and bulkheads in critical subsystems, alongside improved rate limiting and throttling at our edge. These measures automatically cut off problematic traffic or dependency calls, preventing a local issue from becoming a global outage.

Cloudflare's 'Fail Small' Initiative: Building a More Resilient Network for the Future
Source: blog.cloudflare.com

Revised Break Glass Procedures and Incident Management

When anomalies occur, quick and safe response is crucial. We overhauled our "break glass" procedures—emergency overrides that allow engineers to bypass normal safeguards during critical incidents. The goal was to ensure that such overrides are both fast to use and safe, reducing human error during high-pressure situations.

Our incident management process was also updated. We now have clearer escalation paths, predefined roles (e.g., incident commander, communications lead), and mandatory post-incident reviews that feed directly into our engineering backlog. This ensures that lessons learned are systematically applied to prevent recurrence.

Preventing Drift and Regressions Over Time

Complex systems naturally drift away from their intended secure and resilient configurations. To counter this, we introduced a set of automated checks and periodic audits. New configuration changes are automatically compared against a set of policy rules that encode best practices for safety and consistency. If a change would create a regression or violate a policy, it is either blocked or flagged for review.

We also deployed regression test suites that run continuously in our test environments, simulating major failure scenarios. These tests verify that previous fixes remain effective and that no new changes reintroduce old vulnerabilities.

Improved Customer Communication During Outages

We understand that during an outage, timely and transparent communication is essential. We overhauled our status page and internal communication tools to provide faster, more granular updates. We now have a dedicated incident communications team that works alongside engineers to craft clear statements, including estimated times to resolution (ETR), root cause explanations, and affected services. We also integrated our monitoring systems directly into our communication channels so that alerts automatically generate status page updates.

What This Means for You

As a result of Code Orange: Fail Small, Cloudflare's network is now more resilient, secure, and reliable. Configuration changes are safer, failures are contained more effectively, and our team can respond faster and more accurately. We will continue to build on this foundation, but the work completed this month represents a major milestone in our commitment to delivering a rock-solid platform for every customer.

Thank you for your trust and patience as we made these improvements. We remain dedicated to keeping the Internet fast, secure, and reliable.

Tags:

Recommended

Discover More

xoso66Spider-Man's New AI Companion Sparks Debate Among Fans Ahead of 'Brand New Day'tic88k8cctop88Kia EV6 Sees Major Price Reduction of Up to $6,000 in the U.S. Marketj88xoso66top88How to Secure Your WAN with Post-Quantum IPsec Using Cloudflaretic88j8810 Critical Lessons from the UNC6692 Cyber Attack: Social Engineering, Custom Malware, and Browser ExtensionsCSPNet Breakthrough: New Architecture Delivers Performance Gains Without Compromising Speedk8cc