Radio Silence

On November 18, 2025, at exactly 11:20 UTC, network engineers around the world froze. Network traffic charts, usually pulsing with life, suddenly nosedived or—in the case of error codes—shot vertically upwards. Users trying to access their favorite services, from e-commerce platforms to simple blogs, hit a wall in the form of an "Internal Server Error" (Error 500) message. Discord was down, Shopify was having issues, and even the Cloudflare status page itself seemed to be sputtering.

This event, which many of us watched with growing anxiety, fits into a grim series of outages this autumn, as we already wrote in our summary: Déjà vu: Is the Internet Going Through a "Dark Age"?. We observed a similar paralysis a month earlier during the major Amazon Web Services outage.

Today, now that the dust has settled and the servers are humming (or rather, whirring with fans) in their normal rhythm again, we have a chance to peek under the hood of this incident. Thanks to the transparency of the Cloudflare team, we know exactly what happened. And no, it wasn't a hacker attack, although everything pointed to that initially. It was a lesson in humility for software engineering—a story of how a "safe" database permission change and one unfortunate logic error in the code can stop the internet.

I invite you on a technical journey into Cloudflare's "black box." Grab a coffee, because this is going to be a long, deep dive.

"It Must Be an Attack!" – The Fog of War

Every large-scale incident looks like chaos in the first phase. When monitoring systems started flashing red at 11:20, the first thought in Cloudflare's Network Operations Center (NOC) wasn't "bug in the code." Suspicion fell on a massive DDoS attack.

Why? Because the symptoms were misleading. Systems didn't fail completely and instantly. Strange fluctuations were observed—traffic dropped, then returned, then dropped again. This is the classic image of defensive systems battling a botnet that is changing attack vectors. Moreover, internal engineer chats raised concerns that this might be continued activity from the "Aisuru" botnet, which had been flexing its muscles in recent weeks. Even Matthew Prince, Cloudflare's CEO, expressed concern: "I'm worried this is that big botnet showing its strength."

The situation was worsened by the fact that the Cloudflare Status page also went down. Theoretically, it is hosted outside their infrastructure to remain available during outages. However, a coincidence led engineers to suspect a sophisticated targeted attack hitting multiple points simultaneously.

Only a deeper analysis of the logs revealed that the problem wasn't coming from the outside. The source was beating in the very heart of the system.

Understanding ClickHouse: The Architecture of the Problem

To understand what really happened, we must dive into the database architecture used by Cloudflare. The company heavily utilizes ClickHouse—a columnar OLAP database, ideal for analyzing massive amounts of data in real-time.

In ClickHouse architecture, a cluster consists of multiple "shards" (fragments). To query data from all fragments simultaneously, so-called Distributed tables are used. These tables usually reside in a database with a default name. However, the physical data, stored on individual nodes, resides in tables "underneath," often in a system database designated as r0 (or similar, depending on convention).

Until the day of the outage, users (and systems) querying table metadata (e.g., a list of columns) saw only what was in the default database. Access to the r0 database was implicit.

The Change That Changed Everything

At 11:05 UTC, a change was deployed aimed at improving security and reliability. Ironic, isn't it?

The engineers wanted distributed queries to operate based on more precise permissions of the user initiating the query. To achieve this, the permission configuration was changed so that access to tables in r0 became explicit. This allowed the system to better control resource limits and permissions for individual sub-queries. On paper—a great, almost textbook change in the spirit of "least privilege" and better control.

In practice, this change had one unintended side effect.

The Query That Killed the System

Cloudflare's system has a Bot Management module. It is responsible for assessing every HTTP request to determine if it comes from a human or a bot. This system relies on machine learning (ML) models. These models need "features"—specific traffic attributes upon which they base their assessment.

The configuration of these features is not static. It is generated dynamically every 5 minutes based on traffic analysis, allowing Cloudflare to adapt to new threats almost in real-time.

The process of generating this configuration (the feature file) executes a query to the ClickHouse database to fetch a list of available columns (features).

The key error was that this query did not filter results by database name. Before 11:05, this query returned a list of columns only for the table in the default database. After 11:05, when permissions to r0 became explicit, the same query began returning the column list twice: once for the table in the default database, and a second time for the table in the r0 database.

Instead of a list of unique features, the system received a list full of duplicates.

Memory and Fatal Error Handling

Now we move from the database layer to the application layer. Cloudflare's main proxy system (internally called "FL" or "FL2" in the newer version) is written in Rust. Rust is famous for memory safety and performance, but it is also very strict when it comes to error handling.

For performance optimization purposes, the Bot Management module in the proxy pre-allocates memory for features. Engineers set a "hard" limit on the number of features that could be handled. This limit was 200. This was well above the usage at the time, which hovered around 60 features. It seemed like a safe margin—more than a threefold buffer.

However, when the database query started returning duplicate rows, the resulting configuration file "bloated." The number of defined features (due to duplicates) exceeded the limit of 200.

Here we get to the core of the code problem. The developer used a construct that assumed the operation of adding features to the buffer would always succeed. In the programming world, and especially in Rust, there is an instruction (often called unwrap) that tells the compiler: "Trust me, there won't be an error here, and if there is—stop the whole program".

Because the configuration file was corrupted (too large), this instruction caused a so-called panic. The thread handling network traffic simply "died."

The Loop of Death and Domino Effect

If this had happened once, the system would have restarted and recovered. The problem was that the corrupted configuration file was propagated to all servers in the Cloudflare network.

Server downloads the new configuration (with duplicates).
Proxy attempts to load it.
Code hits the 200 feature limit.
Process "panics" and terminates.
The supervisor system restarts the proxy process.
Process comes up, downloads configuration... and we are back to step 1.

This caused a global restart loop on thousands of machines. For the end-user, this looked like a 500, 502, or 504 error, because the proxy was unable to handle the request.

Interestingly, the older version of the proxy ("FL") didn't "panic," but due to the faulty configuration, it simply stopped working correctly, assigning a "bot score" of zero to all traffic. This meant that clients relying on bot blocking might have been flooded with spam, or (if they had rules blocking low scores) blocked real users.

Why Did the Dashboard and Other Services Fail?

Cloudflare is a system of interconnected vessels. The failure of the main proxy dragged other services down with it:

Workers KV: This service is the foundation of many Cloudflare systems. Since access to it occurs via the main proxy, the avalanche of errors cut off access to data.
Cloudflare Access: This service relies on Workers KV. No KV meant no possibility of login and authorization.
Dashboard (Client Panel): Cloudflare's admin panel uses Turnstile (their alternative to CAPTCHA) for login. Since Turnstile wasn't loading (because the proxy was down), no one could log in to check what was happening. Even the employees themselves had trouble with internal tools.

This is a classic example of a circular dependency in a crisis situation.

The Road to Recovery: From Chaos to Stabilization

The repair process was not trivial. Remember the "fog of war."

11:20 - 13:05: The team fought the symptoms, thinking it was a KV performance issue or an attack. Attempts were made to reroute traffic and limit accounts.
13:05: Breakthrough. A bypass was successfully deployed for Workers KV and Access so that these services would skip the main, broken proxy. This allowed engineers to regain control over tools and reduce the impact of the outage on key internal systems.
13:37: The culprit was identified—the Bot Management configuration file. It was understood that this was causing the crash in the code.
14:24: Automatic generation and propagation of new (faulty) files was stopped.
14:30: The last known good version of the configuration file was manually "injected" into the distribution system.
Restart: A proxy restart was forced. Since the file was now correct (fitting within the 200 limit), the processes came up correctly and stopped restarting.

Until 17:06, engineers were "cleaning up" after the outage, restarting the remaining services that had entered an error state during the main incident.

Conclusions: A Lesson for the Whole Industry

The Cloudflare incident is a brutal reminder of several fundamental principles of distributed systems engineering that are easily forgotten in the rush for innovation.

First: Input validation isn't just protection against the user. Cloudflare treated data from its own database as "trusted." The code assumed the database would return unique columns because "it always had." A change in another part of the system shattered that assumption. Treat data from internal systems just as suspiciously as data from a user.

Second: Errors shouldn't kill the process. In production code, a panic in the main thread handling network requests is unacceptable. A configuration error should result in logging the error and possibly operating in a fallback mode, but never in a total shutdown of the service.

Third: Observability in extreme conditions. The fact that debugging systems consumed vast amounts of resources during the outage (trying to report errors) further worsened the situation by increasing latency. Monitoring systems cannot burden the system they are trying to diagnose.

Fourth: Gradual rollout. The database change that caused the problem and the propagation of the faulty file had global consequences. The lack of a gradual rollout mechanism (canary release) for configuration files meant the error hit everyone at once. Cloudflare has already announced that it will introduce testing procedures for configuration files identical to those for code.

Summary

The autumn of 2025 will be written in the history of the internet in black letters. After the AWS outage in October, Cloudflare now joins the infamous "Total Paralysis" club. This shows how much our digital world relies on just a few pillars. When one of them wobbles—whether due to a malicious attack or one redundant row in a database—the entire global economy shudders.

What happened on November 18 was not the result of hackers. It was a human error, a process error, and a code error. And while it sounds banal, at the scale Cloudflare operates, clichés turn into catastrophes.

Will it happen again? Certainly. Not necessarily at Cloudflare, not necessarily in the same way. But systems of such complexity are inherently prone to errors. Our task—as security specialists and administrators—is to build resilient systems that can survive the fall of a giant. Or at least have a Plan B ready when we suddenly see a 500 error on half the sites we visit.

You can read more about the context of this "Black Autumn" and other outages in our summary article: Déjà vu: Is the Internet Going Through a "Dark Age"?. It's also worth reading our analysis of cloud risks in strategic sectors.

Stay safe and... check your error handling systems!

Source: Official Cloudflare Blog.

Aleksander

Radio Silence

I invite you on a technical journey into Cloudflare's "black box." Grab a coffee, because this is going to be a long, deep dive.

"It Must Be an Attack!" – The Fog of War

Only a deeper analysis of the logs revealed that the problem wasn't coming from the outside. The source was beating in the very heart of the system.

Understanding ClickHouse: The Architecture of the Problem

Until the day of the outage, users (and systems) querying table metadata (e.g., a list of columns) saw only what was in the default database. Access to the r0 database was implicit.

The Change That Changed Everything

At 11:05 UTC, a change was deployed aimed at improving security and reliability. Ironic, isn't it?

In practice, this change had one unintended side effect.

The Query That Killed the System

The configuration of these features is not static. It is generated dynamically every 5 minutes based on traffic analysis, allowing Cloudflare to adapt to new threats almost in real-time.

The process of generating this configuration (the feature file) executes a query to the ClickHouse database to fetch a list of available columns (features).

Instead of a list of unique features, the system received a list full of duplicates.

Memory and Fatal Error Handling

However, when the database query started returning duplicate rows, the resulting configuration file "bloated." The number of defined features (due to duplicates) exceeded the limit of 200.

Because the configuration file was corrupted (too large), this instruction caused a so-called panic. The thread handling network traffic simply "died."

The Loop of Death and Domino Effect

If this had happened once, the system would have restarted and recovered. The problem was that the corrupted configuration file was propagated to all servers in the Cloudflare network.

Server downloads the new configuration (with duplicates).
Proxy attempts to load it.
Code hits the 200 feature limit.
Process "panics" and terminates.
The supervisor system restarts the proxy process.
Process comes up, downloads configuration... and we are back to step 1.

This caused a global restart loop on thousands of machines. For the end-user, this looked like a 500, 502, or 504 error, because the proxy was unable to handle the request.

Why Did the Dashboard and Other Services Fail?

Cloudflare is a system of interconnected vessels. The failure of the main proxy dragged other services down with it:

Workers KV: This service is the foundation of many Cloudflare systems. Since access to it occurs via the main proxy, the avalanche of errors cut off access to data.
Cloudflare Access: This service relies on Workers KV. No KV meant no possibility of login and authorization.
Dashboard (Client Panel): Cloudflare's admin panel uses Turnstile (their alternative to CAPTCHA) for login. Since Turnstile wasn't loading (because the proxy was down), no one could log in to check what was happening. Even the employees themselves had trouble with internal tools.

This is a classic example of a circular dependency in a crisis situation.

The Road to Recovery: From Chaos to Stabilization

The repair process was not trivial. Remember the "fog of war."

11:20 - 13:05: The team fought the symptoms, thinking it was a KV performance issue or an attack. Attempts were made to reroute traffic and limit accounts.
13:05: Breakthrough. A bypass was successfully deployed for Workers KV and Access so that these services would skip the main, broken proxy. This allowed engineers to regain control over tools and reduce the impact of the outage on key internal systems.
13:37: The culprit was identified—the Bot Management configuration file. It was understood that this was causing the crash in the code.
14:24: Automatic generation and propagation of new (faulty) files was stopped.
14:30: The last known good version of the configuration file was manually "injected" into the distribution system.
Restart: A proxy restart was forced. Since the file was now correct (fitting within the 200 limit), the processes came up correctly and stopped restarting.

Until 17:06, engineers were "cleaning up" after the outage, restarting the remaining services that had entered an error state during the main incident.

Conclusions: A Lesson for the Whole Industry

The Cloudflare incident is a brutal reminder of several fundamental principles of distributed systems engineering that are easily forgotten in the rush for innovation.

Summary

Stay safe and... check your error handling systems!

Source: Official Cloudflare Blog.

Aleksander

Anatomy of a Disaster: Why Cloudflare Went Silent? Technical Analysis of the November 18 Incident

Radio Silence

"It Must Be an Attack!" – The Fog of War

Understanding ClickHouse: The Architecture of the Problem

The Change That Changed Everything

The Query That Killed the System

Memory and Fatal Error Handling

The Loop of Death and Domino Effect

Why Did the Dashboard and Other Services Fail?

The Road to Recovery: From Chaos to Stabilization

Conclusions: A Lesson for the Whole Industry

Summary

About the Author

Udostępnij:

Powiązane artykuły

Internet na Kolanach: Cloudflare Pada Miesiąc po AWS. Kronika Czarnej Jesieni

Chmura w Sektorach Strategicznych: Silnik Innowacji czy Systemowe Ryzyko?

Luki 0-Day: Niewidzialna Broń. Anatomia, Rynek i Obrona przed Nieznanym

Komentarze

Przekonaj się sam!

Wyślij nam wiadomość

Anatomy of a Disaster: Why Cloudflare Went Silent? Technical Analysis of the November 18 Incident

Radio Silence

"It Must Be an Attack!" – The Fog of War

Understanding ClickHouse: The Architecture of the Problem

The Change That Changed Everything

The Query That Killed the System

Memory and Fatal Error Handling

The Loop of Death and Domino Effect

Why Did the Dashboard and Other Services Fail?

The Road to Recovery: From Chaos to Stabilization

Conclusions: A Lesson for the Whole Industry

Summary

About the Author

Udostępnij:

Powiązane artykuły

Internet na Kolanach: Cloudflare Pada Miesiąc po AWS. Kronika Czarnej Jesieni

Chmura w Sektorach Strategicznych: Silnik Innowacji czy Systemowe Ryzyko?

Luki 0-Day: Niewidzialna Broń. Anatomia, Rynek i Obrona przed Nieznanym

Komentarze