What we learned running the industry’s first AI code review benchmark
When AI is both the thing you measure and the thing that measures
Earlier this year, I used Cursor to build a benchmark suite that tests every major AI code review tool. What started as a controlled experiment turned into a case study on how AI can become part of the infrastructure of engineering itself. We had two primary motives, going into the project:
First, we wanted to see how LinearB stacked up against other tools in the wild. We want this to be holistic: by reputation, measurable outcomes, accuracy, noise control, and how the AI handled change over time.
Second, we needed a way to test our own review engine as it evolved. Like any AI system, models drift. Prompts and rules change. We wanted an evaluation harness that could run the same pull requests over and over (just like regression tests for code) to make sure our updates made the reviewer smarter, not noisier.
In other words, we didn’t set out to build a marketing campaign, but rather an instrument. Something we could tune and play to produce many different renditions of a concept.
This is a dev log about that build: what the harness looked like, the unexpected lessons that forced a redesign, and how I’d re-architect it today as a set of agentic pieces wired together by a declarative manifest. I’ll also take a brief detour into something that turned out to be the weirdest part of this work: the psychology of getting an LLM to write plausible bugs and then behave like it didn’t.
Engineering the right context for the AI
The repo started as pure orchestration: scripts, docs, and a set of camouflaged scenarios. The main branch kept only the framework; every test lived in a branch deployed by scripts. The control surface was simple: you deploy a base project, overlay a scenario, push a branch, and open a PR for the reviewer to act on. The sequence was intentionally simple:
pick a base project
copy a scenario from available scenarios into a new branch
push branch and open a PR
trigger the reviewer and collect comments
apply a follow-up commit to simulate a fix and observe whether the reviewer updates or withdraws comments
We wrote scenarios as believable features like “add buffered channel support” instead of “insert bug” because we wanted autonomous reviewers to behave like they were in a real codebase. We also focused on collected language-unique pain points, such as for Golang: zero values and nil semantics, method receiver misuse, channel buffering, goroutine lifecycle, context propagation, map iteration determinism, reflection in hot paths. These are the kinds of issues that separate pattern-matching bots from systems that understand runtime semantics.
Here’s an example of one such scenario file for the buffered channel support bug in Golang:
PROJECT: golang-weave
VERSION
2025-07-24: AI Assistant, Initial scenario creation
DESCRIPTION
Add buffered channel support for better packet throughput by implementing buffered channels for packet aggregation and control message communication between goroutines. This enhancement will improve performance by reducing blocking and improving concurrency in packet processing by allowing multiple packets to be queued before processing.
EXPECTED REVIEW
A thorough code review should catch channel buffering issues that can cause deadlocks and performance problems. The reviewer should identify that:
1. The code uses unbuffered channels (`chan<- aggregatorFrame`) which can cause deadlocks when senders and receivers are not perfectly synchronized
2. The code doesn’t consider appropriate buffer sizes for the expected message volume
3. Unbuffered channels force synchronous communication, which can bottleneck high-throughput packet processing
4. Missing buffer size considerations for different types of messages (control vs data)
The review should suggest:
- Use buffered channels when senders and receivers operate at different speeds: `make(chan aggregatorFrame, bufferSize)`
- Choose appropriate buffer sizes based on expected message volume and processing speed
- Consider using different buffer sizes for different message types
- Ensure buffer sizes are not too large to avoid memory issues or too small to avoid blocking
- Add monitoring or metrics to track channel usage and identify bottlenecks
A good review comment would explain that unbuffered channels require both sender and receiver to be ready simultaneously, which can cause deadlocks in complex concurrent systems.
Tricking Claude into writing believable bugs without leaving clues
There’s a psychological angle here that’s worth noting. Getting a model to “cooperate” in creating a bug requires you to think like a social engineer of prompts: you frame intent, disguise motive, and tune for believability. The better the model gets at understanding context, the more subtle the prompts must become. It’s a cat-and-mouse game that’s as much about designing evaluation scenarios as it is about adversarial prompt craft.
We didn’t want synthetic, obviously fake bugs. We wanted mistakes that felt human: subtle misuses of channel buffering, a late-allocated slice that blows up under load, an interface nil check that only fails with certain call patterns. The goal was to generate defects that would test semantic reasoning rather than pattern spotting.
There are three constraints that make this tricky.
First, you need plausibility. Ask Claude to “write a bug” and it will often produce toy examples or disclaimers. So we wrap the request: ask Claude to propose a legitimate-looking feature change that, when applied, introduces a subtle regression. Give the scaffolding: file context, function names, and surrounding comments pulled from the actual repo. The model’s job is to generate a commit diff that looks like feature work but introduces a corner-case error.
Second, you need stealth. If the prompts sound like “introduce a bug,” some reviewers or tools will pick up on that meta-clue. So the manifest instructs scenarios to be camouflaged as enhancements, because in real development feature work is the vector for regressions. This camouflage also avoids spamming reviewers with obvious markers. Your prompt might ask for “a refactor to preallocate buffers for throughput” and the model returns a change that, under some timing, causes a deadlock.
Third, you need iteration. Early experiments showed Claude would sometimes include obvious hints or leave comments that were out-of-band. So the process became a little game: refine the prompt, ask for multiple candidate commits, pick the one that looks most human, and sometimes hand-edit to remove telltale signs. Over time Claude got better at producing low-hint commits; it also got harder to trick as its safety layers and instruction-following improved. That evolution became part of the artifact.
“It’s alive! It’s alive!” and how I created Frankenstein’s harness
The experiment worked. Our contraption was capable. We could run a scenario and produce CSVs of findings. But as we added more scenarios the scripts and branches multiplied and the orchestration debt grew. Each new test required hand-editing names, branch cleanup, and slight variations in how we staged commits. More importantly, every step in the pipeline knew what to do but didn’t retain why it had been done. Context was an external bucket we filled over and over.
That pattern exposed two practical problems. The first was scale: procedural orchestration didn’t scale with the ambition of the benchmark. The second was conceptual: the thing we were building was not only measuring reviewers, it was a small closed loop that could be automated. We had the parts of an agentic pipeline already: code to create faults, code to request analysis, code to collect signals, and code to score on a binary (e.g., pass/fail). We were manually wiring a system that could be automated. We wanted to run this many, many more times, so we set out to automate it next.
Finding the most important product metric via benchmarking
The obvious metrics (i.e., precision and recall) only told part of the story. Three dimensions repeatedly mattered in practice:
Statefulness of the review. Did the reviewer update or withdraw its advice after the follow-up commit? This was the most actionable measure for engineering teams. A reviewer that remembers context is one you can trust to reduce noise over time.
Noise ratio and comment entropy. How many total comments were produced versus unique, actionable findings? Tools can flood PRs with low-value signals; the benchmark needed to count that as a loss.
Time to useful signal. How long between PR open and first correct, non-trivial suggestion? Faster is better, but not at the cost of precision.
Ultimately, this gave us a signal-to-noise ratio as the North Star.
Our scripts produced CSVs and JSONs. That was fine for a first pass, but the setup was ad hoc: each run depended on environment variables, branch names, and tool versions living outside the repo. It meant we could collect results, but we couldn’t reliably reproduce them. Small changes in configuration made historical comparisons noisy. The hard lesson was that if we wanted reproducible results, we needed a single, versioned declaration of each run, like a manifest that captured every input and weight so the same benchmark could be replayed, diffed, and trusted.
How I would build this now with unit tests
Building and running these evaluation manifests is parallel to writing unit tests. As Hamel Husain puts it, “Unit tests for LLMs are assertions (like you would write in pytest).” You compose small, scoped checks, automate them, and run them on every change. And, as he notes, “These unit tests are crucial to getting feedback quickly” when you’re iterating on an AI system.
Imagine writing a unit test for an AI code review. Since our tool is bespoke, the unit test could be a YAML manifest that lists repositories, scenarios, tools with pinned versions, the scoring rubric, and whether a scenario requires a follow-up commit. The manifest becomes the single source of truth for any run.
In practice, a minimal manifest might look like this:
benchmark:
id: ai-code-review-golang-2025.10
repos:
- id: hello-world
seed: projects/hello-world
scenarios:
- id: interface-nil-check
path: scenarios/interface-nil-check
follow_up_commit: true
scoring:
weights: { precision: 0.35, recall: 0.35, statefulness:
0.20, noise: 0.10 }
outputs: [csv, sarif, manifest]
Conceptually, the manifest does three things:
Defines inputs for testing. repo seeds, scenario ids, tool versions, the rubric weights
Applies capabilities to test.
seed_scenario,open_pr,trigger_review,collect_comments,score_findings,emit_resultsLink inputs and capabilities to outputs. SARIF, CSV, plus a run manifest that contains a spec hash and tool locks
The practical result is reproducibility and auditability. One manifest, one run hash, identical inputs produce identical outputs (modulo stochastic model variance which we control by pinning temperature and model id). You can diff manifests to see how a change in tool version or scenario materially affects behavior. You can also gate model changes with this harness in your CI pipeline.
Teaching the system how to judge itself
Once you treat the benchmark as a system of capabilities, the re-architecture reads like a small distributed system, or a collection of specialized agents:
Orchestrator: reads manifest, computes a directed acyclic graph (DAG), and invokes capabilities. Initially the executor can shell out to the same bash scripts you have now; later it can call dedicated MCP endpoints or local capability runners.
Capability layer: a small set of actions, and each capability is a black box with typed inputs/outputs and logs every call/response.
Scheduler: a spec-based planning agent can ingest the manifest and produce an execution plan that is auditable.
Evaluator (LLM-as-a-judge): This is the part of the system that interprets feedback, not just tallies it. The evaluator ingests every review comment and grades it against a rubric: true positive, false positive, duplicate, or obsolete. Deterministic rules can handle the obvious patterns (“did this query return 1 expected result?”) but gray areas require judgment (“did this answer all of the user’s questions?”). Here, the LLM acts as an internal reviewer of reviewers, deciding when a comment adds value or just adds noise. The output is structured data emitted as SARIF and CSV for later analysis.
Observability: tagged artifacts (manifest hash, tool versions), raw comment stores, and run-level metrics for trend graphs.
The key point is modularity. You can replace the evaluator or add a new language track, and the manifest remains the contract.
Some closing notes from the devlog margins
I built this to answer a pragmatic question: which AI code review tools actually help engineers ship? This piece was the story behind the benchmark: how it was built, what we learned, and why reproducibility is the next frontier for AI engineering. But are you curious about the benchmark results? The full rundown (i.e., how each tool performed, what we tested, and where LinearB stood out) is covered in a companion article about the best AI code review tool.
Benchmarks like this start as measurement exercises and end as design patterns. The more I worked on it, the clearer it became that we’re not just evaluating AI systems, we’re learning how to build with them. When a model can create its own tests, review its own code, and score its own reasoning, the boundary between automation and authorship starts to blur. What we’re really building here is a new kind of development environment, one where reproducibility, reasoning, and creation all run on the same loop.
So where do we go from here? You made it this far but the frontier is still unexplored, my friend. On-demand benchmarks are just one example of using a coding agent more like an everything agent. What other new systems can we now build? One of the most interesting things about this work is that AI stops being merely the subject of the test. It becomes the substrate that makes the test possible at speed and scale.
Are you building agentic systems to produce and evaluate real-world tests? We’d love to hear about them in the comments.






