The Context Tax: Why Step 12 Costs 42x Step 1 (Measure It in 40 Lines)
In short: the context tax is what you pay when every agent step re-sends the whole session transcript as input again, so step N re-bills turns 1..N and total cost grows with n(n+1)/2. Cheaper tokens lower the unit, not the shape. context_tax.py meters the re-bill multiplier offline; one debugging session measured 42.8x.
AI disclosure: I drafted this with an AI writing assistant. The tool, the fixtures, and every number below come from a real local run of the script in this post on tiktoken o200k_base. I reviewed and edited it before publishing.
Token prices have been sliding all year. Your agent bill probably hasn’t.
I kept running into the same confusion in my own FinOps notes: per-token rates drop, and the monthly number goes the other way. The usual answers (“you’re using a bigger model,” “you have more users”) didn’t explain a single session getting more expensive as it ran. So I wrote a 40-line meter to look at the one thing nobody charts: the session transcript itself. On a synthetic-but-realistic debugging session, the last step billed 42.8x the input of the first step. Same model. Same task. No new users.
That gap has a boring cause and an annoying consequence. Here’s both, plus the script.
TL;DR. Every step of an agent loop re-sends the whole conversation so far (history plus tool outputs) as input. So step N pays for turns 1..N again, and total input grows roughly with n(n+1)/2. Cheaper tokens don’t fix the shape; they just lower the unit on a number that’s still climbing. context_tax.py (below, keyless, offline) meters three things from a session JSON: the re-bill curve, the re-bill multiplier, and a dead-weight estimate. On my bloated fixture it reported a 42.8x multiplier and 19.3% dead weight, and exited 1 as a CI gate.
Why a transcript gets billed again every single step
Here’s the part that trips people up. An LLM call is stateless. The model doesn’t “remember” turn 3 when you make turn 12. Your framework re-sends turns 1 through 11 as input so the model can see them. Every. Single. Step.
So the cost of one step isn’t the cost of that step’s new text. It’s the cost of the entire history up to that point. Step 1 bills a short user message. Step 12 bills the user message plus a file dump plus a wide grep plus a stack trace plus every assistant reply in between. The new tokens at step 12 might be tiny. The billed input is not.
Logan (Waxell) put the shape plainly in The Compounding Math Your Architecture Is Hiding: “total cost grows roughly with n(n+1)/2,” and a turn-10 context can sit at 80,000–200,000 tokens. That post nails the problem and then points you at a proprietary runtime. I wanted the opposite: a tiny script I can run on my own transcript and check into CI. So that’s what this is.
And it’s why “tokens got cheaper” is the wrong consolation. Edwin Lisowski’s Token Prices Are Falling. So Why Is Your AI Bill Going Up? lists the drivers: full context re-sent each step, tool schemas eating 30–60% of the window before any user content, retries and sub-agents running around the clock. That schema overhead is a sibling tax worth metering on its own — I did exactly that for MCP servers in measure your MCP server’s token tax, where the tool definitions are billed on every call before a single user token. His illustrations are blunt. He cites AT&T going from 1B to 27B tokens/day over 18 months. (Those are Lisowski’s examples, not my measurements; I’m attributing them to him.) Cheaper unit, bigger n. The unit lost.
You can’t estimate this. You have to measure it.
There’s a second reason to meter instead of guess. Agents are bad at predicting their own spend.
The arXiv paper How Do AI Agents Spend Your Money? (Bai, Huang, Wang, Sun, Mihalcea, Brynjolfsson, Pentland, Pei) measured agentic coding tasks and found three things worth pinning to the wall: agentic runs burn roughly 1000x the tokens of a plain code-chat; the same task can vary up to 30x in cost run to run; and models “fail to accurately predict their own token usage,” with correlations up to just 0.39. A 0.39 correlation is barely better than a shrug.
So the takeaway writes itself: meter the transcript, don’t trust the estimate. If the model can’t call its own number, your gut can’t either.
The tool: 40 lines, no API key
context_tax.py reads one JSON file: a session transcript as a list of turns (role + content, tool results included). It tokenizes with tiktoken’s o200k_base and reports four things.
- Re-bill curve — billed input at each step, i.e. the cumulative history 1..N re-sent on that call.
- Re-bill multiplier — billed input at the last step ÷ billed input at the first step. How much more “one call” costs at the end versus the start.
- Dead-weight % — tokens from early turns whose terms basically never resurface in later turns (a turn counts as dead if under 15% of its terms reappear downstream). Stuff you keep paying to re-send that the model isn’t really using. It’s the same dead-weight idea I applied to persistent stores in auditing your agent’s memory tax — there the stale entries ride along on every retrieval; here they ride along on every step.
- $ / session — total billed input across all steps × a rate you pass in. The rate is a parameter, not a hardcoded vendor price — this is a compounding illustration, not your invoice.
The exit code is the point. 0 if the multiplier is under threshold (a disciplined session), 1 if it’s over (the architecture is compounding, so fail the build), 2 for usage. Drop it in CI and a session that balloons becomes a red check, not a surprise line item.
#!/usr/bin/env python3
"""context_tax.py - meter the re-bill tax on a single agent session's transcript."""
import json, re, sys
THRESHOLD = 12.0 # re-bill multiplier above this = compounding architecture
DEAD_OVERLAP = 0.15 # a turn is dead weight if <15% of its terms resurface later
STOP = set("the a an of to in is it on for and or but with as at by from this that be are was you your i we they it's".split())
try:
import tiktoken
_enc = tiktoken.get_encoding("o200k_base")
def count(t): return len(_enc.encode(t))
TOKENIZER = "tiktoken o200k_base (exact)"
except Exception: # honest fallback, ~+-15% vs real BPE
def count(t): return max(1, round(len(t) / 4))
TOKENIZER = "len/4 heuristic (tiktoken not installed; ~+-15%)"
def words(t): return {w for w in re.findall(r"[a-z0-9_]{4,}", t.lower()) if w not in STOP}
def main(argv):
if len(argv) < 2:
print("usage: context_tax.py <session_transcript.json>"); return 2
s = json.load(open(argv[1], encoding="utf-8"))
rate = float(s.get("input_usd_per_mtok", 3.0)) # $/1M input tok; configurable, NOT a vendor quote
turns = [t["content"] for t in s["turns"]]
tok = [count(c) for c in turns]
later = [words(" ".join(turns[i + 1:])) for i in range(len(turns))]
billed, dead = [], 0
for n in range(len(turns)): # step n re-bills the running history 1..n
billed.append(sum(tok[: n + 1]))
if n < len(turns) - 1:
w = words(turns[n])
overlap = len(w & later[n]) / len(w) if w else 1.0
if overlap < DEAD_OVERLAP:
dead += tok[n]
mult = billed[-1] / billed[0] if billed[0] else 0
total_billed = sum(billed)
print(f"context_tax | {argv[1]} | tokenizer: {TOKENIZER} | rate=${rate}/Mtok | threshold x{THRESHOLD}")
print("-" * 78)
for n, b in enumerate(billed):
bar = "#" * round(b / billed[-1] * 40)
print(f" step {n + 1:>2} billed_input={b:>6}t {bar}")
print("-" * 78)
print(f" re-bill multiplier (step {len(billed)} / step 1) : x{mult:.1f}")
print(f" dead-weight (never referenced later) : {dead}t = {dead / billed[-1] * 100:.1f}% of the final payload")
print(f" total billed input across session : {total_billed}t (${total_billed / 1_000_000 * rate:.4f} at ${rate}/Mtok)")
print(f" exit : {1 if mult > THRESHOLD else 0}")
return 1 if mult > THRESHOLD else 0
if __name__ == "__main__":
sys.exit(main(sys.argv))
No key, no network, read-only. pip install tiktoken, point it at a transcript JSON, done. If tiktoken isn’t installed it falls back to a len/4 heuristic and says so out loud (~±15% off real BPE). I’d rather print the caveat than pretend the number is exact.
The real run
Two fixtures ship with the script. Both are synthetic coding sessions (no private data) but shaped like the real thing.
session_lean.json is a disciplined session: small tool outputs, and a deliberate scope reset before the second task. Here’s the actual output:
context_tax | session_lean.json | tokenizer: tiktoken o200k_base (exact) | rate=$3.0/Mtok | threshold x12.0
------------------------------------------------------------------------------
step 1 billed_input= 25t ####
step 2 billed_input= 46t ########
step 3 billed_input= 75t #############
...
step 10 billed_input= 235t ########################################
------------------------------------------------------------------------------
re-bill multiplier (step 10 / step 1) : x9.4
dead-weight (never referenced later) : 56t = 23.8% of the final payload
total billed input across session : 1335t ($0.0040 at $3.0/Mtok)
exit : 0
Multiplier 9.4x, under the 12x threshold, exit 0. Green. Note the dead weight is still 23.8%: that’s the first task’s context the model no longer needs in the second task. Even a clean session carries dead weight until you actually trim. The scope reset kept the multiplier down; it didn’t zero the waste.
session_bloated.json is the one that hurts. A 12-step debugging session that never trims: a full module dump, a wide repo grep, a long stack trace, and the kicker, a verbose pip check dependency log that gets re-sent on every step after it. Real output:
context_tax | session_bloated.json | tokenizer: tiktoken o200k_base (exact) | rate=$3.0/Mtok | threshold x12.0
------------------------------------------------------------------------------
step 1 billed_input= 40t #
step 2 billed_input= 72t ##
step 3 billed_input= 421t ##########
step 4 billed_input= 480t ###########
step 5 billed_input= 857t ####################
...
step 12 billed_input= 1713t ########################################
------------------------------------------------------------------------------
re-bill multiplier (step 12 / step 1) : x42.8
dead-weight (never referenced later) : 331t = 19.3% of the final payload
total billed input across session : 11774t ($0.0353 at $3.0/Mtok)
exit : 1
42.8x. Over threshold, exit 1: a failed build. Watch step 3 in the curve. The full file dump jumps billed input from 72 to 421 tokens, and you pay that bump again on every one of the nine steps that follow. The 331 dead-weight tokens are mostly that pip check log (boto3 versions, urllib3 pins) that never came up again but kept riding along in the payload.
Both numbers are reproducible. I hashed two consecutive bloated runs with shasum -a 256 and got identical digests, so the output is deterministic, not a fluke of one run.
One honest correction. I’d guessed the multiplier would land near 16x when I started (that’s the figure floating around the n(n+1)/2 discussions). The real run said 42.8x. The bloated fixture front-loads a big file dump on a small first turn, which stretches the ratio. The lesson isn’t “16x vs 42x.” It’s that the number depends entirely on your transcript shape, which is exactly why you measure your own instead of borrowing mine.
What to actually do about it
The fixes aren’t exotic. The point of the meter is to tell you which one you need, and to prove it worked.
- Scope-reset between tasks. The lean fixture does this: drop the prior task’s context before starting the next one. It’s the difference between 9.4x and 42.8x here.
- Trim or summarize fat tool outputs. That
pip checkdump was 19.3% dead weight. Replace a 300-token log with a one-line “deps OK, no conflicts” and you stop re-billing it nine times. - Rolling summarization for long sessions: collapse old turns into a short recap once they’re settled, instead of carrying them verbatim.
Then re-run the meter. If the multiplier drops back under threshold, the exit code flips to 0 and your CI gate goes green. That’s the whole loop: measure, cut, prove. Not “trust me, I optimized it,” but a number that moved. This meter slots in alongside the other checks in my pre-execution gate for AI agents — same philosophy, fail fast before the spend, not after the invoice.
What this is NOT (so I don’t oversell it)
- It does not block or cap anything at runtime. It’s a meter and a CI gate, not a spend guard. If you want the runtime brake that stops a session mid-loop, that’s a different tool — see the sliding-window spend guard, which caps cumulative cost over a window instead of just measuring it after the fact.
- It does not compute your real provider invoice. The
$/sessionfigure uses a rate you pass in, to illustrate compounding. Your actual bill depends on caching, batching, output tokens, and your vendor’s pricing — none of which this models. - Dead-weight is a lexical heuristic, and it has false positives. “Under 15% of terms resurface later” is a proxy for “the model stopped using this,” not proof of it. The model may have leaned on an early turn implicitly without repeating its words. On my bloated fixture the stack trace landed at 0.16 overlap, just above the line, correctly kept, because the fix really did reference it. Treat the % as a flag to go look, not a verdict.
- It does not optimize your context for you. It tells you where the tax is. The cutting is still your call.
What’s the worst re-bill multiplier you’ve measured on one of your own long sessions? Run the script on a real transcript and tell me in the comments. I’m collecting shapes, and I read every reply. Follow for the next number from the next run.