Reviews & Comparisons

How Cloudflare Built an AI Code Review System That Actually Works at Scale

Cloudflare scaled AI code review by building a CI-native orchestration system with seven specialized agents, coordinated to deduplicate and prioritize findings, reducing cycle time and catching real bugs.

Published 2026-05-04 19:35:32 • Paintou Staff

Code review is critical for catching bugs and sharing knowledge, but at Cloudflare’s scale, it became a major bottleneck. Merge requests sat in queues for hours, waiting for human reviewers to context-switch. After testing off-the-shelf AI tools that lacked flexibility, and a naive LLM approach that produced noisy results, Cloudflare engineered a custom solution. This Q&A breaks down the journey, architecture, and results of building a CI-native orchestration system that coordinates specialized AI agents to review code efficiently and accurately.

What specific problem did Cloudflare encounter with traditional code review?

Cloudflare’s engineering teams faced a classic bottleneck: merge requests queuing up for human review. The median wait time for a first review was often measured in hours, not minutes. Each review cycle—context-switching, reading diffs, leaving nitpicks about variable naming, waiting for author responses—slowed down delivery. At Cloudflare’s scale, with thousands of repositories and hundreds of engineers, this inefficiency compounded. The goal was to reduce cycle time without sacrificing code quality, knowledge sharing, or security oversight.

How Cloudflare Built an AI Code Review System That Actually Works at Scale — Source: blog.cloudflare.com

Why didn’t existing AI code review tools meet Cloudflare’s needs?

Cloudflare experimented with several third-party AI code review tools. Many performed well and offered decent customization and configurability. However, none provided the deep flexibility required for an organization of Cloudflare’s size and complexity. The tools couldn’t be tailored to enforce internal engineering standards (like the Engineering Codex) or integrate seamlessly into their CI/CD pipelines. They also lacked the ability to spawn multiple specialized agents per review. In short, off-the-shelf solutions were too rigid for the nuanced demands of a global-scale tech company.

What was the naive LLM approach and why did it fail?

As a pragmatic next step, Cloudflare tried a straightforward method: take a git diff, feed it into a basic LLM prompt, and ask the model to find bugs. The results were overwhelmingly noisy. The model produced vague suggestions, hallucinated syntax errors, and even recommended “consider adding error handling” to functions that already had it. This approach lacked context about the project’s conventions, dependencies, and coding standards. It became clear that a naïve summarization method couldn’t handle complex codebases—it was too unreliable to put in a CI/CD critical path.

How did Cloudflare design its AI code review system?

Instead of building a monolithic code review agent from scratch, Cloudflare created a CI-native orchestration system around OpenCode, an open-source coding agent. The system launches up to seven specialized reviewers per merge request, covering security, performance, code quality, documentation, release management, and compliance with Cloudflare’s internal Engineering Codex. A coordinator agent manages these specialists: it deduplicates findings, judges actual severity, and posts a single structured review comment. This modular, agent-based architecture allows the system to scale across thousands of repositories without overwhelming engineers with noise.

What are the specific roles of the seven specialized agents?

Each agent focuses on a distinct aspect of code quality: Security flags vulnerabilities like injection flaws or weak cryptography. Performance spots inefficient algorithms or resource leaks. Code quality checks for maintainability, readability, and adherence to best practices. Documentation ensures that changes include clear docs. Release management checks for version bumps, changelogs, and migration steps. Compliance enforces Cloudflare’s Engineering Codex, including internal policies and coding standards. A seventh agent acts as a generalist to catch anything else. The coordinator then synthesizes their outputs, avoiding duplicate or contradictory advice.

What results has Cloudflare seen after deploying this system?

Cloudflare has run this system internally across tens of thousands of merge requests. It has proven effective at approving clean code with minimal false positives, while flagging real bugs with impressive accuracy. It actively blocks merges when it detects genuine serious problems, such as security vulnerabilities or compliance violations. Engineers report faster review cycles and fewer interruptions. The system is a key component of Cloudflare’s broader “Code Orange: Fail Small” initiative, which aims to improve engineering resiliency by catching issues early and automatically.

What is the underlying architecture of the system?

The architecture is designed for extensibility, described internally as “plugins all the way to the moon.” It is built around OpenCode, an open-source coding agent, and integrates natively with Cloudflare’s CI/CD pipeline. The system abstracts version control (GitHub, GitLab) and repository-specific configurations via plugins. Each specialized reviewer is a separate plugin that receives the diff and relevant context. The coordinator agent is also a plugin that aggregates results. This modular design allows teams to add or remove reviewers, tweak prompts, or adjust severity thresholds without rewriting core logic. The entire orchestration runs inside the CI environment, triggering automatically on each merge request.