Site Title

AI Spend Management: How to Cut Your Token Bill Without a Cap

Linkedin
x
x

AI Spend Management: How to Cut Your Token Bill Without a Cap

Publish date

Publish date

For most of 2025, AI spend management was an afterthought. The message from the top was simple: use AI, use a lot of it, and do not slow down to count. A few companies even ran internal leaderboards to see who could push the most through the models. Heavy usage looked like progress, and nobody was watching the meter.

 

Then the bills arrived. Uber put AI coding tools in front of its engineers and ran through its entire annual budget in four months. Finance teams across the industry started opening invoices several times larger than the forecast, with no clear sense of what had driven them.

 

Four months. How long Uber’s full-year AI budget lasted. $1,500 a month. The per-engineer cap it set in response.

 

The reaction now forming in most companies is the spending cap: a hard dollar limit on what each person, team, or tool can spend on AI in a month. It is simple, it makes the next invoice predictable, and it feels responsible. In our experience building and running these systems, it points at the wrong target.

 

Most of an AI bill is not the work your team is doing. It is the token spend wrapped around the work, the automatic and mostly invisible consumption that every task drags behind it. A cap cannot tell that wrapper apart from the real work, so it cuts by the only thing it can measure, which is volume. Your highest-volume people are often your most productive. We covered why the bill climbs even as per-token prices fall in the first piece of this series. This is what to do once it has climbed, and it begins with seeing where the money actually goes.

 

Where Your Token Bill Actually Goes

A chatbot answers a question and stops. One call, a little input, a short reply.

 

An agent works differently. It reads context, makes a plan, runs a step, checks the result, revises, pulls more context, and loops until the job is done. The model keeps no memory between steps, so every loop resends the whole conversation as new input. A task that runs forty steps pays for the same context forty times.

 

The simplest way to see what that does to a bill is to watch one ordinary task run two ways. The task: pull six facts from an incoming customer email, the sender, their company, the order number, the issue, the sentiment, and the action they want.

Costed at published Anthropic rates: Opus 4.8 at $5 and $25 per million tokens, Sonnet 4.6 at $3 and $15.

The same six facts reach the same person either way. One run cost roughly four hundred times the other, and none of that gap was the work. It was the wrapper: the model chosen, the context resent, the reasoning paid for and discarded, the output left to ramble.

 

That wrapper has a few usual hiding places. Here is each one, and whether a spending cap does anything about it.

A Cap Sorts People by Volume. Your Heaviest Users Are Often Your Best.

A monthly dollar limit ranks everyone by how many tokens they burn and trims from the top. The assumption underneath is that the biggest spenders are the biggest wasters.

Sometimes that holds. Often it is the reverse. The person at the top of the usage report is frequently the one who rebuilt their workflow around agents and now carries the output of more than one person. We wrote about that person in the piece on who leaves after a deployment. A cap tells them their most productive month read as a billing problem.

The reflex is industry-wide right now, and the whole conversation has moved from “use everything, move fast” to “make it stop.” The moves all follow the same shape.

Company The move What it removes
Uber Capped each engineer at $1,500 a month per tool, after spending its annual budget in four months A ceiling on the heaviest work, whatever it produces
Microsoft
Winding down most of its internal Claude Code use
The freedom to pick the right tool for the task
GitHub
Shifted its coding assistant to token-based billing
Predictable monthly costs

Every one of those lowers a number. Not one of them touches a row in the table above. The retry loops keep looping, the routine work keeps running on frontier models, the reasoning keeps getting billed and thrown away. You lower the bill, keep everything that built it, and tell your strongest people that depth is a liability.

 

See Where the Money Goes Before You Decide What to Cut

Control is the right goal. The order is what most teams get backward, setting the limit before they can see what the limit will hit.

Keep your stack. Your model contracts, your tools, your team, and your approved workflows do not have to move. The one layer that changes first is visibility, and it is narrower than it sounds. For every meaningful use of AI, you want three things in view: what it cost, which model ran it, and what it produced.

Once that exists, the waste names itself, and almost all of it maps straight back to the table:

  • Route the routine work down to a cheaper model, where most of it belonged in the first place.
  • Turn caching on for the instructions you send again and again.
  • Stop the retry loops that bill in full and return nothing.
  • Reserve step-by-step reasoning for the problems that need it, and let the rest answer directly.

Each of those lowers the bill without lowering the output, because each one removes wrapper while leaving the work intact. That is the line a cap cannot draw.

 

The Cap Lowers the Number. It Never Tells You Why.

You came into this worried about a number, and the number is real. Most of it is not your team’s work. It is the wrapper around the work: the wrong models, the context resent on every loop, the reasoning paid for and discarded, the jobs that failed and billed anyway. A cap leaves all of it running and trims the people doing the most instead.

See where the money goes first. The cut almost always gets smaller, and a great deal smarter, once you can see what you are cutting.

This is the work we do. We help you see exactly what your AI is costing and where every dollar goes, then bring the bill down without slowing your team. If your company is reaching for a spending cap this quarter, let’s talk before you set it.

Related Insights

The Elastic Workforce. How AI Agents Are Complementing Your Best People.

Every operations leader knows the exact moment their infrastructure breaks. The volume of incoming work exceeds the physical capacity of the team.

Future of Cybersecurity Teams: AI + Human Expertise for Scalable Defense

Cybersecurity is evolving rapidly. Threats are becoming more sophisticated, attacks happen in real time, and modern IT environments are massive. Traditional approaches alone can’t keep up. The future belongs to hybrid defense systems—teams that combine AI’s speed and scale with human judgment and expertise.

The Stablecoin Opportunity That Banks Are Missing 

Stablecoins have evolved from niche crypto assets to core components of the global financial system. Unlike traditional cryptocurrencies, stablecoins maintain a stable value while leveraging blockchain technology, enabling fast, transparent, and borderless transactions.

Working on something similar?​

We’ve helped teams ship smarter in AI, DevOps, product, and more. Let’s talk.

Stay Ahead of the Curve in Tech & AI!

Actionable insights across AI, DevOps, Product, Security & more