Whether tokenmaxxing or tokenminimizing, you’re measuring the wrong thing
Cameron Etezadi (LaunchDarkly), James Everingham (Guild), and Alex Salazar (Arcade) on the metric that replaces the leaderboard
For 30 days, the busiest leaderboard at Meta ranked its engineers by the tokens they burned. What shipped, and whether it held up, never made the board. An internal tool nicknamed Claudeonomics logged 60.2 trillion AI tokens in a single month, a volume worth roughly $900 million at list prices, though a buyer Meta’s size pays a steep discount on that figure. The count kept climbing toward 73.7 trillion before the screenshots leaked, the backlash landed, and Meta retired the board.
It is the cleanest picture we have of a habit that spread through engineering organizations over the past year. Disney built its own version, an internal adoption dashboard that surfaced the heaviest Cursor and Claude users across roughly 4,800 product and engineering staff, where one person logged about 460,000 invocations in nine days. Call it tokenmaxxing, the assumption that AI consumption is the same thing as AI progress, that a team burning more tokens is a team getting more done. Real velocity gains from AI exist. They are not found on that leaderboard, and the leaders capturing them have stopped counting the things it counted. Their executives are now asking what the investment produced, and whether it held up once it shipped. No token leaderboard was built to answer that.
The AI questions shifted to impact
We asked five engineering leaders what their executives are pressing them on right now. Their answers rhymed.
Cameron Etezadi, CTO of LaunchDarkly, put the shift in two sentences. “Six months ago, the question was ‘how do we adopt AI faster?’” he told us. “Now it’s ‘how do we know if it’s actually working?’” Boards want AI investment to show up as shipping velocity, incident reduction, and cost efficiency, he said, not adoption counts. “That’s a harder question to answer, and honestly, it’s the right one.”
James Everingham, former head of DevInfra at Meta and now CEO of Guild, framed the same move as a measurement gap. “The question I’m asking is: are we measuring work or value?” he said. “An agent doing more work doesn’t necessarily mean the business is getting more value. Most organizations can tell you what they’re spending. Very few can tell you what they’re getting in return. That’s the gap I think the industry needs to close.”
The executive question shifted from activity to outcome when the inference bills hit their inbox. In Futurum’s 1H 2026 survey of enterprise software buyers, direct financial impact nearly doubled to 21.7 percent as the primary measure of AI return, while self-reported productivity gains fell to 18 percent. The people holding the purse stopped accepting effort as evidence almost immediately.
The scoreboard fails when costs spiral
The metrics most teams reach for first share one flaw: they count effort. Adoption rates measure who logged in. Token consumption measures who burned the most, which rewards waste by design. Seats activated measure procurement. The volume of code generated was a discredited proxy even before AI made code cheap to produce. Each answers how busy the team was, and none answers whether the system produced something an executive can carry into a board meeting. MIT’s research on enterprise AI delivered the bluntest version of the failure data, finding that 95 percent of initiatives showed no measurable return within six months.
What stood out across our conversations is that every leader had already retired the same metrics, independently. Alex Salazar, CEO of Arcade, cut token count first. “High usage can actually be a bad signal,” he said. “If people are spending more time massaging outputs they can’t trust unsupervised, that’s not actual agentic adoption.” Everingham retired seat count and raw velocity. Etezadi retired token spend, dollars per employee, and consumption. “Neither turned out to be the right metric,” he said, “and it’s pretty obvious it never was.”
The finance side reached it too, which is worth digesting. This month, as GitHub Copilot moved to metered pricing, OnlyCFO, the most-read CFO newsletter on Substack, published “Tokenmaxxing is Dead”, a walk through six ways to cut AI spend, from routing work to cheaper models and caching prompts to capping the heaviest users and teaching prompt hygiene. Every engineering org should do it, and it lowers the bill.
But look at what the cost playbook can measure and what it can’t. Tokenmaxxing inflated a fake numerator, treating usage as a stand-in for work. The cost crackdown shrinks the denominator, lowering the spend per task. Both optimize one half of a ratio, but neither knows whether the output shipped, held up, or created rework downstream.
The sharpest proof comes from Meta itself. Weeks after the Claudeonomics board came down, Meta told staff it will cap token usage and set budgets, and built an internal dashboard called AI Gateway to track spend in real time with alerts for unusual spikes. Meta’s own CTO, Andrew Bosworth, put the lesson to staff plainly: “All motion is not progress and token usage alone is not a measure of impact of any kind.” The industry already has a name for the move, tokenminimizing, the mirror image of the game it replaced. The companies that ran the tokenmaxxing leaderboards are now running an inverted tokenminimizing one, which still fails to capture outcomes.
Fundamentals like review time, rework, change-failure rate, cycle time, and the stability of what reaches production are the closest proxy for value an engineering organization can put in front of an executive. They don’t claim to be revenue, but they answer the question a spend dashboard cannot: whether the work was worth doing that way.
The bottleneck moved downstream to code reviews
AI changed the speed of one stage of the software lifecycle and left the rest where it was. Writing code got far faster. Reviewing it, testing it, securing it, and deploying it have remained closer to the pace they always have.
In LinearB’s 2026 Software Engineering Benchmarks Report, agentic AI pull requests wait 5.25x longer to get picked up for review than unassisted ones (17.6 hours versus 3.4). The leaders living it described the same wall. “The generation side got faster, the judgment side didn’t,” Everingham said. “AI has dramatically increased the rate at which code is generated, but it hasn’t increased the rate at which humans can evaluate it.”
Etezadi located the slowdown precisely: “the queue isn’t in writing anymore; it’s in validating what was written.” At Arcade, Salazar made the bottleneck deliberate: “PR review remains the main bottleneck right now, by design,” he said, with every merge human-approved, because the cost of a bad merge runs higher than the cost of a human looking first.
There is a second front behind the queue, and it’s trust. “It’s never been easier to go from idea to prototype,” Everingham said. “What’s harder is building confidence that the output is actually correct.” Because the same prompt can return different results across runs, teams have to learn where an agent performs reliably and where it does not before they hand it production. Evaluation, he argues, is becoming the discipline that separates a demo from a deployment.
Again, recent benchmark data fills in the cost of that congestion. AI-assisted pull requests run 2.6x larger than unassisted ones, so each one drops more on a reviewer’s desk. AI pull requests are accepted within 30 days at less than half the rate of unassisted ones, so more of what gets generated is reworked or abandoned rather than shipped. And the stakes of waving code through rose, with Veracode finding that 45 percent of AI-generated code introduced an OWASP Top 10 vulnerability. It matches what the 2025 DORA report found at scale, that AI amplifies the system it lands in. “The teams getting real velocity from AI already had deployment discipline before AI showed up,” Etezadi said. “The teams struggling thought AI would solve their release-process problems. It didn’t. It amplified them.”
Solve the constraints first
The leaders seeing genuine gains share something they built before the tool arrived.
Etezadi’s high performers “already had deployment discipline before AI showed up. They had feature flags, progressive rollouts, and fast rollbacks. AI gave them more code to ship, but the infrastructure to ship it safely was already there.” Everingham starts a step upstream, at the constraint. “You need to understand your constraints first,” he said. “The biggest mistake I see is teams assuming AI is the bottleneck when the real one is elsewhere. Reviews take four days. Deployments require multiple approvals. If you don’t know what’s slowing work down today, adding agents just accelerates the chaos.” Salazar’s team is deliberate about where AI belongs. They put engineering time upfront into behavior-driven development docs that agents then build against, and they keep people on the decisions that set direction. “We don’t use AI for ideation,” he said. “Figuring out what to build, and why, is where human judgment and insight compound.”
Then the convergence returns, on the other side of the ledger. The same leaders who independently retired the same metrics independently named the same replacement. Salazar trusts one signal. “Did the agent complete a real action in production that didn’t get rolled back or kicked to a human?” Etezadi watches “behavior in production. Are changes stable? Are rollbacks happening? Are we catching drift before users feel it?” Everingham tracks cycle time, handoff delays, rework rates, and how quickly teams resolve issues. Normalize the phrasing and it really names a single metric in three voices: work that reached production and stayed there.
Lead the transition, or inherit it
The uncomfortable truth underpinning all of this is that the finance organization is ahead of engineering in terms of spend control. It’s likely your CFO has already stopped believing in adoption metrics as the board started asking for financial impact on that adoption, a phenomenon OnlyCFO covers with great specification.
If engineering doesn’t bring an outcome model to the table, finance will solve the problem with a tool it can reach for easily, which is commonly a cost cap. Meta already imposed one, only weeks after running its leaderboard. A hard limit on spend per developer is fast, defensible, and blunt, but it flattens the wrong thing because it says nothing about whether the work was worth doing.
The alternative is to lead with a framework that measures the system. “Stop measuring how much AI your team is using,” Etezadi said, “and start measuring what’s happening in production.” The engineering leaders who can answer this have the credibility to ask for the next AI budget. The ones who don’t will have the answer decided for them.
You have to solve your org’s problem before someone else does it for you. Don’t know where to start? Join our workshop next week, where you’ll learn how to map your own delivery data to the outcome questions finance is already asking. You’ll leave with a thesis you can defend in your budget review, and a strategy that scales across your engineers. And your RSVP comes with a companion guide that covers the same ground in writing (mmm, delicious agent food 🍖).





