Agent Loop Cost: 11x Your Per-Call Quote, in 40 Lines


Agent loop cost is what you pay per task, not per call — and it runs multiples higher than your per-call quote because every tool-call re-bills the whole system prompt plus every tool description. A 40-line offline forecaster reads one JSONL trace and prices the full loop before you ship. On my bloated fixture it measured an 11.29x gap.

On my own bloated fixture, the forecaster measured an effective cost of $2.26 per task against a $0.20 per-invocation quote — an 11.29x gap — with a cumulative-cost curvature of k = 2.14, which is the literal meaning of “the bill grows quadratically with the number of tool-calls.” That’s my run on my fixture, not a claim about your agent. The whole point of the tool is that it tells you your number, and on a short healthy loop it says the gap is basically zero.

I want to be careful with one number up front. You may have seen the 30x figure going around — Muskan’s June 2026 Dev.to writeup, “Why Claude Agent Loops Cost 30x a Single Inference” (dev.to/muskan_8abedcc7e12). That 30x is their aggregate over 10,000 invocations a day, not a single-task measurement, and it’s not mine. My single-task gap on a deliberately bloated 14-step loop came out to 11.29x. I’m titling this post with the number I actually measured, rounded down. If your loop is worse, the tool will say 30x or 40x — but I won’t put a number in a headline that my own run didn’t produce.

TL;DR

  • A team budgets an agent at the price of one call; the task runs N calls, and each call re-bills the fixed system tax (system prompt + every tool description) plus the growing tail of prior results. The cumulative bill is the area under that curve.
  • loop_forecast.py is a ~40-line offline, keyless, read-only script. Feed it a JSONL trace; it prints effective $/task, the forecast_gap against your per-invocation quote, and a curvature k (cumulative bill ~ calls^k).
  • On my bloated fixture: $2.26/task vs $0.20 quote = 11.29x, k = 2.14, exit 1. On a clean 3-step loop: gap 0.04x, k 1.18, exit 0 — the contrarian claim falsified on its own terms.
  • Exit code is a pre-execution CI gate: 0 if the loop is cheap or linear, 1 if it’s both super-linear and over quote, 2 on usage error.
  • This is a forecaster from a trace, not a runtime cap. It does not block your agent. It blocks your build.

Why your per-call quote is the wrong unit

The mistake is unit confusion, and it’s an easy one to make. You open the pricing page, you see $3.00 / 1M input tokens, you estimate a call at ~8k tokens of input, and you write down $0.20 per agent action. Multiply by your daily call volume and you have a budget. It feels rigorous. It’s wrong by an order of magnitude. Not because the per-call price is wrong; that part is right. The error is that the task is not one call.

Here’s what actually gets billed. An agent loop re-sends its entire working context as input on every single step. The system prompt goes out again. Every tool description in the inventory goes out again. And the transcript of everything the agent has done so far — every prior tool result — goes out again, growing each turn. So step 1 bills a small payload, step 8 bills a large one, and the task cost is the sum of all of them. That sum is the area under a rising curve. The area under a rising line is quadratic.

Muskan’s writeup put concrete shape on the replay: 8 tools at ~200 tokens each is 1,600 tokens of inventory replayed on every step, and by step 8 the single-step context had ballooned to roughly 83,000 input tokens. Augment Code’s guide on agent loop token cost makes the same structural claim independently — that naive loops “rebill prior context on every call,” so input cost rises quadratically and a “20-step loop” can run “more than 10x the naive per-step estimate”. Both of those are their measurements, cited as third-party support, not my numbers. My contribution is a tool that computes your own version of it from a trace you already have.

The contrarian claim, stated so it can be wrong

Here’s the position, sharp enough to argue with: you budget your agent at the price of one call and you pay the area under a curve. And that area is quadratic in the number of tool-calls, because the fixed system tax replays on every step while the result tail keeps growing — and you can compute the whole curve from a trace before you run it in production, without ever touching your provider’s billing API.

That claim has a clean failure mode, and I built it into the tool. If your loop is short, your tool inventory is small, and the model exits early, the per-step curve barely rises, the cumulative cost is near-linear, and the gap to your quote is small. In that case the claim is false for you, and loop_forecast.py returns exit 0 and says so. I ran exactly that case. The clean fixture — 420-token system, three tool descriptions, three steps, small results — came back with gap 0.04x and k 1.18. No breach. If your real traces all look like that, this whole thesis doesn’t apply to you, and I’d rather the tool tell you than have you take my word.

What this is NOT

Three adjacent costs already have their own tools in this series, and this one is none of them.

  • It is not the re-bill tax on a transcript. The context tax that re-bills your transcript every step takes a flat list of conversation turns and measures the compounding of the message history — turns 1..N re-sent as input. That tool’s input is role/content turns; this tool’s input is tool-call records with a manifest, and it models the replay of the loop’s structure (system + tool descriptions as a fixed tax) plus the result tail, and it compares to an external quote. Different input, different metric, different question.
  • It is not the static MCP inventory tax. The token tax of your connected MCP server inventory measures the one-time context cost of having tools connected — the descriptions sitting in your context window before the agent does anything. This tool takes that same inventory and shows what it costs when it’s replayed across every step of a multi-call loop. The inventory tax is the standing charge; this is the metered usage.
  • It is not a spend cap. This is a forecast from a finished trace. It blocks nothing at runtime. If you want execution to actually halt when a dollar ceiling is hit, that’s a sliding-window spend guard at runtime. loop_forecast.py is the pre-execution gate for AI agents variant: it fails your CI build before the expensive loop ever ships.

The tool: loop_forecast.py

The constraints first, because they’re why you can run this on a production trace without a security review: offline, keyless, read-only, zero network. No vendor SDK, no API key, nothing leaves your machine. It reads one JSONL file and prints. Tokenization is real — tiktoken with the o200k_base encoding — with an honest len/4 fallback if tiktoken isn’t installed; that fallback is roughly ±15% off true BPE, and the output says which one ran.

The input is a JSONL trace. One manifest record, then one record per tool-call:

{"type":"manifest","system_tokens":1200,"tool_descriptions":[210,190,230,200,180,220,205,195,215,185],"quoted_usd_per_invocation":0.20,"input_usd_per_mtok":3.0}
{"type":"call","tool":"read_file","result_tokens":5400}
{"type":"call","tool":"run_tests","result_tokens":8100}

Every field is something you already have or can count: the system-prompt size, the size of each tool description, the per-invocation price you budgeted with, and the input price per million tokens. If a call record has no result_tokens, the tool tokenizes the result text itself.

The forecast is four deterministic rules, no model in the loop, no randomness:

fixed = sys_t + desc                       # system tax replayed every step
billed, cum, tail, run = [], [], 0, 0
for rt in tails:                           # step n bills fixed tax + accumulated prior results
    b = fixed + tail; billed.append(b); run += b; cum.append(run); tail += rt
eff_usd = run / 1e6 * price                 # area under the curve = cumulative billed input
gap = eff_usd / quote if quote else float("inf")
# curvature: slope of log(cumulative billed) vs log(step) -> total bill ~ calls^k.
def fit(ys):
    xs = [math.log(i + 1) for i in range(len(ys))]; yl = [math.log(v) for v in ys]
    mx, my = sum(xs) / len(xs), sum(yl) / len(yl)
    den = sum((x - mx) ** 2 for x in xs)
    return sum((x - mx) * (y - my) for x, y in zip(xs, yl)) / den if den else 0.0
k = fit(cum)
rc = 1 if (gap > gate_x and k >= max_k) else 0

Rule one: per-step billed input is the fixed tax plus the accumulated tail. Rule two: effective $/task is the sum of all steps — the area. (One honesty note: this counts input tokens only, so the gap is fair only when the per-invocation quote you compare against is also an input budget; output-token pricing is out of scope, see the limits section.) Rule three: the gap is that effective cost divided by the per-invocation quote you budgeted with. Rule four: fit the cumulative bill to calls^k and gate on it. One honest note on the curvature: I fit the cumulative cost, not the per-step cost. The per-step curve is roughly linear (k ≈ 1.4 on the spiral); its integral — the running total you actually pay — is the quadratic one (k = 2.14). I had this backwards in my first pass, fitting the per-step curve and reading k ≈ 1.0, which made a clearly quadratic loop look linear. Fitting the cumulative series is the fix.

The full file is 72 lines with the CLI argument handling, the tokenizer shim, and the output formatting. The forecast itself — the part above plus parsing the manifest — is about 40. I’d rather keep the readable version than golf it down to make a headline literal.

What it prints on a healthy loop (exit 0)

Run it on the clean fixture — three tool-calls, a small inventory, an early exit:

$ python3 loop_forecast.py fixtures/loop_clean.jsonl
loop_forecast.py | tokenizer: tiktoken/o200k_base
steps: 3 | system: 420t | tool_descriptions: 360t (replayed every step)
per-step billed input (tokens): [780, 960, 1110]
effective $/task: $0.0086  (area under replay curve @ $3.00/Mtok)
naive per-invocation quote: $0.2000
forecast_gap: 0.04x  (effective / quote)
curvature k: 1.177  (cumulative bill ~ calls^k; 1.0=linear, 2.0=quadratic)
gate: gap>8x AND k>=1.3 -> PASS
exit: 0

The effective cost is below the quote here — a three-step loop with tiny results is cheaper than the single-call budget assumed, because the budget over-provisioned. Gap 0.04x, k 1.18, gate PASS. This is the falsification working. A cheap loop reads as cheap.

What it prints on a spiral (exit 1)

Now the bloated fixture — a 1,200-token system prompt, ten tool descriptions (2,030 tokens of inventory replayed every step), and fourteen steps whose tool results grow into the thousands as the agent reads files, runs tests, and re-parses output:

$ python3 loop_forecast.py fixtures/loop_spiral.jsonl
loop_forecast.py | tokenizer: tiktoken/o200k_base
steps: 14 | system: 1200t | tool_descriptions: 2030t (replayed every step)
per-step billed input (tokens): [3230, 6430, 11830, 18630, 26730, 36230, 43430, 54430, 64230, 76730, 86930, 95530, 108530, 120030]
effective $/task: $2.2588  (area under replay curve @ $3.00/Mtok)
naive per-invocation quote: $0.2000
forecast_gap: 11.29x  (effective / quote)
curvature k: 2.139  (cumulative bill ~ calls^k; 1.0=linear, 2.0=quadratic)
gate: gap>8x AND k>=1.3 -> BREACH
exit: 1

Read the per-step list. Step 1 bills 3,230 tokens. Step 8 bills 54,430 — in the same order of magnitude as Muskan’s ~83k step-8 figure, and I kept mine deliberately under theirs so I’m not borrowing their drama. By step 14 a single step bills 120,030 input tokens, almost all of it re-sent context. The task costs $2.26. You quoted $0.20. That’s the 11.29x in the title. It’s the number the tool actually produced, not a target I reverse-engineered.

The gate trips because both conditions hold: gap over 8x and cumulative k at or above 1.3. A loop that’s expensive but linear (a long, simple, single-tool job) won’t trip it; neither will a curvy but cheap one. You want both signals before you fail someone’s build.

Determinism, because a flaky gate is worse than no gate

A CI gate that returns different numbers on the same input is useless — you’ll mute it the first week. So the arithmetic is integer token counts with no randomness and no network. I hashed the stdout twice on each fixture:

clean  run1: 455a86ce7e1df9cdca74f072c5d5e2919dac8f91889d950769673e7998bd506d
clean  run2: 455a86ce7e1df9cdca74f072c5d5e2919dac8f91889d950769673e7998bd506d
spiral run1: 450d51f471b747c224c3782c6d8b4af8acddc1db677b073389e5de0a09ff74f3
spiral run2: 450d51f471b747c224c3782c6d8b4af8acddc1db677b073389e5de0a09ff74f3

Byte-identical. Same trace in, same gate out. Bad JSON returns exit 2 with the parser’s error, and no arguments prints usage and exits 2 — so the gate fails loud on a broken trace instead of silently passing.

Where this is wrong, and where I’m guessing

The model assumes input-token replay is the dominant cost, and on a tool-heavy agent loop it usually is — but if your steps generate large outputs (long generations, not long contexts), output pricing matters and this tool ignores it. It also assumes you can name your per-invocation quote honestly; garbage quote in, garbage gap out. And the curvature fit needs enough steps to mean anything — on a 3-step loop the k value is noisy (which is why the gate also requires the gap condition). I’d trust the gap number on any trace and treat k as a shape hint, not a precise exponent.

The fixtures here are constructed to be realistic, not harvested from a specific production run — the tool sizes and step counts come from Muskan’s and Augment’s published figures, the math is mine. Run it on your own exported traces and the numbers stop being illustrative.

Run it on your loop

Save the script, write one manifest line and one line per tool-call from a trace you already have, and run it. If your gap comes back under 2x, your per-call budgeting was fine and you can ignore all of this. If it comes back at 11x like mine — or worse — you now have a number to put in front of whoever signs the cloud bill, computed before the loop ever shipped.

Here’s the open question I don’t have a clean answer to: prompt caching changes this math, because a cached system-prefix isn’t re-billed at full input price. My forecaster assumes no cache (worst case). What’s the right way to fold a partial cache hit-rate into the per-step replay cost without making the tool lie in either direction? If you’ve modeled that, I want to see how.


Written by Alexey Spinov. AI-assisted, human-verified: the tool, both fixtures, and every number above come from a real local run on 2026-06-21 (Python 3.13.5, tiktoken 0.13.0, o200k_base). I ran it, checked the exit codes (0 / 1 / 2), hashed the output twice to confirm determinism, separated my numbers from Muskan’s and Augment’s cited figures, and edited every line. Offline, keyless, read-only, zero network.

Follow for the next numbers from production agent traces. What’s the worst per-task-vs-quote gap you’ve found on a real loop — and did anything in CI catch it before the invoice did?