Passing the Eval Isn't Solving the Task: 3 Leaks, 60 Lines


Passing the eval is not solving the task: a green agent eval certifies nothing if the agent can write the files the grader reads, or reach the reference answer. This 60-line static probe read one harness spec, flagged 3 contamination points (2 write-read, 1 reference leak), and exited 1 without running the agent.

I keep seeing the same screenshot in agent-eval threads: a wall of green checkmarks, “98% pass rate,” ship it. Then the same agent face-plants in production on a task that looked identical to a passing eval case.

The usual explanation is “the eval set is too easy” or “distribution shift.” Sometimes. But there’s a quieter failure that nobody screenshots, because it never shows up as red: the harness graded a number the agent itself wrote.

Here’s the claim I’ll defend with a runnable tool: passing the eval is not the same as solving the task. A green run only certifies the agent if two things hold. The channel the agent can write into is disjoint from the channel the grader reads from. And the reference answer is not sitting somewhere the agent can open before grading. Break either one and your “98%” is, by construction, undecidable. The agent could have aced it. It could also have written {"passed": true} to the file the grader trusts. The score can’t tell you which.

You don’t need to rerun the eval to catch this. You can read the wiring.

What contamination looks like in an eval harness

Forget the model for a second. An agent eval harness is mostly file plumbing. There’s a set of paths the agent may write (its workspace, its output dir, a shared scratch DB). There’s a set of paths the grader reads to decide pass or fail (a results file, an exit-code dump, an oracle score). And there’s the reference answer, the gold solution the grader compares against.

Three of these sets are supposed to be carefully separated. In real harnesses they leak into each other for the dumbest reasons: someone pointed the grader at run/results.json to “reuse the artifact,” and run/ happened to be agent-writable. Someone stuffed the expected answer into task_config.json for convenience, and that config is exactly what the agent reads to understand the task.

Two leak shapes cover most of what I’ve seen:

  • C1, write-read overlap. A path the agent can write intersects a path the grader reads. The agent can fabricate the very artifact the grader inspects. This is state pollution. The grader thinks it’s reading ground truth; it’s reading the agent’s homework, graded by the agent.
  • C2, reference leak. The reference or gold answer is reachable inside the agent’s read-set or write-set. The agent can copy the answer instead of deriving it. This is the WebArena-style pattern people have flagged for a while: the expected answer lives in the task config that the agent is handed. (I’m describing the shape here, not quoting a benchmark’s pass-rate; the numbers later in this post are from my own probe, not from any public harness.)

Both shapes are invisible to the eval’s own score. The eval comes back green either way. That’s the whole problem. A contaminated harness can’t report its own contamination, because the contamination is upstream of the number it reports.

The probe: intersect declared paths, don’t run anything

So I wrote the boring thing instead of the clever thing. No sandbox, no instrumentation, no model in the loop. Just take the harness’s declaration of who-can-touch-what and intersect the sets.

The input is one JSON file describing four path sets:

{
  "name": "contaminated-webarena-task-042",
  "agent_write_set":  ["run/results.json", "run/output/*", "shared/state.db"],
  "agent_read_set":   ["config/task_config.json", "run/output/log.txt"],
  "grader_read_set":  ["run/results.json", "shared/state.db", "grader/oracle/score.txt"],
  "reference_paths":  ["config/task_config.json"]
}

The probe matches paths with three rules, because real path declarations aren’t naive string equality. Exact match. Glob match in both directions (so out/* catches out/score.json). And directory containment (so a write scope of logs/ covers a grader read of logs/run.txt). Here’s the matcher, which is the only part with any judgment in it:

def overlap(a, b):
    """True if two declared path patterns can refer to the same location."""
    a, b = norm(a), norm(b)
    if a == b or fnmatch.fnmatch(a, b) or fnmatch.fnmatch(b, a):
        return True
    if b.startswith(a + "/") or a.startswith(b + "/"):
        return True  # directory containment
    for x, y in ((a, b), (b, a)):
        if x.endswith("/*") and y.startswith(x[:-1]):
            return True  # glob-dir vs file
    return False

The analysis is two loops. Cross agent-write against grader-read for C1. Cross the reference paths against both agent sets for C2.

def analyze(spec):
    aw  = [norm(p) for p in spec.get("agent_write_set", [])]
    ar  = [norm(p) for p in spec.get("agent_read_set", [])]
    gr  = [norm(p) for p in spec.get("grader_read_set", [])]
    ref = [norm(p) for p in spec.get("reference_paths", [])]
    out = []
    for w in aw:
        for g in gr:
            if overlap(w, g):
                out.append(("C1", w, g, "agent can write the artifact the grader inspects"))
    for r in ref:
        for a in ar:
            if overlap(a, r):
                out.append(("C2", a, r, "reference answer is in the agent read-set"))
        for w in aw:
            if overlap(w, r):
                out.append(("C2", w, r, "reference answer is in the agent write-set"))
    return aw, gr, ref, out

That’s it. The whole tool, including a CLI and three exit codes, is about 60 lines of logic. It imports sys, json, fnmatch. Nothing else. No network, no subprocess, no eval, no model call. It cannot run your agent or your grader even if you asked it to, which is the point: it’s safe to drop into CI as a pre-merge or pre-publish gate on a repo you don’t fully trust.

The exit code is the gate:

  • 0 clean wiring, the channels are disjoint.
  • 1 contamination found, at least one overlap, treat the green eval as undecided and block the merge or the result.
  • 2 bad input, the spec is missing, not valid JSON, or not a JSON object.

Running it: 0 points clean, 3 points contaminated

I ran the probe on three fixtures. These are verbatim terminal transcripts, copied straight from my terminal, not retyped. The $ command line and the trailing exit= line are the shell, not the program; everything between them is the probe’s own stdout.

A clean harness: agent writes only its own workspace, grader reads an isolated oracle the agent can’t see, reference kept out of reach.

$ python3 eval_contamination_probe.py fixtures/fixture_clean.json
eval: clean-swe-task-001
agent_write_set=2 grader_read_set=2 reference_paths=1
contamination_points=0 density=0.0
by_class=none
verdict=CLEAN
exit=0

Zero points. Density zero. Exit 0. That’s necessary, not sufficient: exit 0 only says the declared path sets don’t intersect under this matcher. It can’t see runtime paths, symlinks, .. aliases, or channels you never declared. So read a clean run as “the wiring I declared is disjoint,” not “the eval is provably valid.”

Now the contaminated harness from the JSON above:

$ python3 eval_contamination_probe.py fixtures/fixture_contaminated.json
eval: contaminated-webarena-task-042
agent_write_set=3 grader_read_set=3 reference_paths=1
contamination_points=3 density=100.0
by_class=C1:2,C2:1
  [C1] run/results.json <-> run/results.json (agent can write the artifact the grader inspects)
  [C1] shared/state.db <-> shared/state.db (agent can write the artifact the grader inspects)
  [C2] config/task_config.json <-> config/task_config.json (reference answer is in the agent read-set)
verdict=CONTAMINATED
exit=1

Three contamination points. Two C1: the grader reads run/results.json and shared/state.db, both of which the agent can write. So the agent can hand the grader a results file that says it passed. The third is C2: the reference answer lives in config/task_config.json, which is in the agent’s read-set. The agent can read the answer.

Density is contamination points divided by the count of grader read-sources, times 100. Here it’s 3 / 3 = 100.0. Be careful how you read that number: it is a severity ratio, not a bounded percentage. It happens to be 100 here because there are 3 points and 3 grader-read sources, but the numerator and denominator are not the same thing. The numerator counts every overlap, including the C2 reference leak, which isn’t a grader-read source at all. So density can climb past 100 (three agent paths all hitting one grader glob gives density=300.0). Treat it as “how alarming,” not “what fraction.” What it tells you here is blunt enough: two of the three things this grader reads are agent-writable, and the gold answer is sitting in the agent’s read-set. This harness measures almost nothing about the agent’s ability. Mostly it measures whether the agent can write files, which every agent can.

And bad input, so CI fails loud instead of silently passing a broken spec:

$ python3 eval_contamination_probe.py fixtures/fixture_badinput.json
error: fixtures/fixture_badinput.json is not valid JSON: Expecting value: line 1 column 42 (char 41)
exit=2

The output is deterministic. I piped the clean run’s stdout (the program’s output, without the shell’s $ and exit= lines) into shasum -a 256 twice and got the same digest both times: c6accb251262e8911422145c5d462318e2b29cb548ded150dc7f15398484c448. The contaminated run hashed to e281f8fc31e9ada72177186f2d7d6c567636c237b4f3cb1e8f0e8493bc9858ac, identical across two runs. Determinism matters for a gate. A flaky gate gets disabled within a week, and then you’re back to trusting green checkmarks.

How this differs from “are my tests testing anything”

If you’ve read my earlier post on whether your tests test anything, this looks adjacent. It isn’t the same target. That tool asks whether your application’s unit tests actually exercise the code or just mirror it back. The object there is the test body.

This probe never looks at the application or its tests. It looks at the eval harness itself, the rig that grades the agent. The object is the wiring: which path sets touch which. You can have perfect application tests and a totally contaminated eval harness sitting on top of them. They’re different layers, and a green checkmark on the wrong layer is the trap.

It’s also not the dependency-gap auditor that intersects imports against declared dependencies. Same genre, static read-only pre-merge gate with an exit code, different intersection: declared paths in an eval spec, not import graphs. And it’s a different question than your agent returning a confident 200 while lying about a single call; this is about the integrity of the whole grading process, not one response. It’s also not the deterministic pre-gate that replaces a flaky LLM judge, which is about the cost of grading, not its integrity.

The franchise underneath all of these is the same opinion I keep defending: gate before you trust. Logging that an eval ran green is not control. Proving the eval could have failed is.

What this is NOT

I’d rather you reject this tool for the right reasons than adopt it for the wrong ones. So, the limits, plainly.

It’s static, not dynamic. The probe reads declarations of who-can-touch-what. It does not watch a real run. If your harness lies about its own path sets, or computes paths at runtime that aren’t in the spec, the probe can’t see them. It catches contamination you declared into existence, which in my experience is most of it, but not all of it. A runtime tracer would catch more and cost more.

A flag is a risk, not a verdict on the agent. An overlap means the wiring allows contamination. It does not prove the agent exploited it. Your agent might be writing run/results.json with perfectly honest content. The point is that the eval can no longer distinguish honest from fabricated, so the green result is undecided. That’s a meaningful and actionable thing to know. It’s not an accusation.

The glob matching is an approximation. I match with fnmatch plus directory containment. That’s deliberately a little eager, it will treat out/* as overlapping out/anything. One gotcha worth naming: Python’s fnmatch lets * cross /, so out/* also matches out/sub/deep.json, not just one segment, which makes my separate “glob-dir vs file” rule mostly redundant and the matcher broader than a shell glob. It also reads *, ?, and [...] as wildcards even inside a literal path name, so a path that contains those characters can match by accident. Expect false positives where a glob is broader than the files that actually land there, and don’t expect canonical-path resolution, symlink, or case handling. I’d rather over-flag a gate than miss a real leak, but if you have a path scheme my rules don’t model, you’ll get noise. The matcher is 10 lines; tune it for your repo.

No path manifest, no opinion. If your harness doesn’t declare its path sets, the probe has nothing to intersect and stays quiet. It will not invent contamination it can’t see. That silence is honest, not a pass. The gate is only as good as the spec you feed it, which is itself a nudge to make your harness declare its channels explicitly.

I built this in an afternoon and ran it on hand-built fixtures, not on a thousand real harnesses. So treat the matcher as a starting point, not gospel. If you wire it into a real eval repo and the directory-containment rule misfires, that’s the first thing I’d loosen.

The fixtures and the full script are in this post. Drop the file into your eval repo, write a spec describing your four path sets, and wire exit != 0 into the same CI step that runs the eval. If the probe exits 1, the green eval doesn’t count yet.


AI disclosure: I wrote and ran this tool myself. An AI assistant helped draft and edit the prose. Every number in this post (0 and 3 contamination points, density, the SHA-256 hashes, the exit codes) is from the actual run shown above, on Python 3.13, offline. Nothing here is borrowed from any public benchmark’s results.

Follow for the next probe in this series, and tell me the worst eval-harness leak you’ve personally hit. Did a grader ever read a file your agent wrote? I read every comment.