Education & Careers

10 Critical Lessons from Cloudflare's ClickHouse Billing Bottleneck

Cloudflare's ClickHouse billing pipeline slowed after a migration; hidden bottleneck in merge sort found and fixed with three patches.

Published 2026-05-17 19:46:16 • Paintou Staff

When your billing pipeline handles hundreds of millions of dollars in revenue, even a slight slowdown can trigger chaos. At Cloudflare, we rely on ClickHouse—an open-source OLAP database—to process millions of queries daily, ensuring invoices go out on time. But after a routine migration, our daily aggregation jobs ground to a halt. Every usual suspect (I/O, memory, rows scanned) looked fine. The real culprit? A hidden bottleneck buried deep inside ClickHouse’s internals. This article breaks down the ten key lessons we learned—from the architecture of our Ready-Analytics system to the three patches that saved the pipeline—so you can avoid similar surprises.

1. The High-Stakes Billing Pipeline

Cloudflare processes millions of ClickHouse calls each day to calculate usage-based bills. If aggregation jobs don't finish quickly, invoices become impossible to reconcile. The pipeline powers not only revenue but also fraud detection and other critical systems. So when it slowed down after a migration, it wasn't just a performance issue—it was a business emergency.

10 Critical Lessons from Cloudflare's ClickHouse Billing Bottleneck — Source: blog.cloudflare.com

2. The Scale: Petabytes and Millions of Rows per Second

We store over 100 petabytes of data across several dozen clusters. The Ready-Analytics table alone had grown to over 2 PiB by December 2024, ingesting millions of rows per second. This scale means even micro-inefficiencies can snowball into major delays—exactly what happened during the bottleneck.

3. Ready-Analytics: Simplicity at Scale

In early 2022, we built Ready-Analytics to simplify onboarding. Instead of designing custom tables, internal teams stream data into a single massive table with a standard schema (20 float fields, 20 string fields, a timestamp, and an indexID). This reduced complexity but also created hidden dependencies that would later become problematic.

4. The Primary Key: Namespace + IndexID + Timestamp

ClickHouse’s performance relies heavily on data sorting. Our primary key uses (namespace, indexID, timestamp) so each namespace’s data is sorted optimally for its queries. While this design boosts typical lookups, it also means that any change to the indexID field—like during a migration—can disrupt the sorting order and slow down merges.

5. The One-Size-Fits-All Retention Policy

Before ClickHouse had native TTL, we built retention by dropping daily partitions older than 31 days. This was fine for many teams, but some needed to retain data for years (legal or contractual obligations) while others needed only days. This restriction forced those teams to opt for complex conventional setups, defeating the purpose of Ready-Analytics.

6. The Migration That Triggered the Slowdown

We migrated the billing cluster to a newer version of ClickHouse and scaled horizontally to handle growth. Immediately, daily aggregation jobs slowed dramatically. Standard diagnostics showed healthy I/O, memory, and CPU usage—nothing unusual. The slowdown was puzzling and dangerous for downstream billing.

7. Investigation: All Usual Suspects Were Clean

We checked rows scanned, parts read, merge frequencies, and partition sizes. Everything looked normal. The query profiles didn’t highlight any obvious hotspot. This forced us to dig deeper into ClickHouse’s internals—beyond the metrics we normally monitor—to find the real problem.

8. The Hidden Bottleneck Inside ClickHouse Internals

The culprit turned out to be a subtle interaction between the migration’s data layout and ClickHouse’s merge sort algorithm. Newly inserted data had a different indexID ordering due to namespace changes, causing the merge process to produce many temporary files and excessive disk seeks. This wasn’t visible in standard counters but accumulated across millions of rows.

9. Three Patches That Fixed the Pipeline

We wrote three targeted patches to ClickHouse: first, optimizing the merge logic to handle ordering skew; second, improving the way data parts are sorted during ingestion; third, adding a new configuration that allows per-namespace retention without sacrificing performance. These changes brought query times back to normal within days.

10. Lessons Learned: Design for Retention Flexibility

The bottleneck taught us that even the simplest systems (like a flat 31-day retention) can become critical when scaling. Now, we’re pushing per-namespace retention into the core of Ready-Analytics—not as an afterthought. Monitoring deeper metrics, such as merge latency and disk seek patterns, has become standard practice.

Conclusion: A hidden bottleneck can stop even the most robust pipeline. By understanding the interplay between schema design, retention policies, and ClickHouse internals—and by being willing to patch the database itself—Cloudflare prevented a billing disaster. The three patches we wrote are now part of our production stack, ensuring billions of dollars in invoices go out without delay.