Your Agent Returns 200 and Lies. Verify Before You Trust


A success gate verifies an AI agent’s claimed success before your system accepts it. SuccessGate runs three read-only checks — schema/contract, claim-vs-evidence against the actual tool-call trace, and an optional post-condition probe — and turns a silent 200 into an explicit REJECTED with reasons. It’s stdlib Python, needs no API key, moves nothing, and ships with a self-test you run in one command.

Here’s the failure that started this for me. An agent in a CRM workflow reported {"status": "sent", ...} for an invoice. Clean run. Green dashboard. 200 OK. The invoice went to a customer id that wasn’t on our allow-list — a near-miss hallucination the model was completely sure about. Nothing crashed. No exception, no stack trace. We found it days later, downstream, the expensive way.

That’s not a rare bug. It’s the default failure mode of agents in production, and it has a name now: silent-success drift. Cycles’ writeup put it bluntly — “200 OK Is the Most Dangerous Response in Production”: “The most dangerous failures look like success.” And the measurements back it up. The Berkeley Function-Calling Leaderboard (BFCL v3) puts frontier-model structurally invalid tool calls at 2–5% even on clean benchmark prompts — higher in noisy production (via Future AGI). The arXiv paper Agent Behavioral Contracts reports that across 1,980 sessions, contracted agents caught 5.2–6.8 soft violations per session that uncontracted baselines missed entirely.

So the question is not how do I see failures sooner. It’s how do I stop accepting a success the agent never actually achieved.

TL;DR

  • A 200 and a "status": "done" are claims, not proof. Agents return both while doing the wrong thing — or nothing.
  • Observability is tracking: it tells you a call happened. It can’t tell you the result was correct. That’s a control problem.
  • verify() runs three checks before you accept success: (1) schema/contract (shape, types, enums/allow-list), (2) claim-vs-evidence (did the agent actually call the tool it says it used?), (3) optional read-only post-condition probe (is the effect really there?).
  • Stdlib only; requests optional and only for the level-3 probe. No keys, no funds, no blockchain.
  • Built-in self-test on 4 fixtures: python success_gate.py. No agent, no key.
  • This is a gate, not an eval suite. Limits at the bottom.

This is the third piece in a small series on controlling agents before they execute, not after. Part one was a hard pre-execution spend-cap for runaway loops — don’t let the agent spend. Part two was a pre-send transaction canary — don’t let it send a bad transaction. This one is the third point on the same pre-execution layer: don’t accept a success that isn’t real. Spend, send, accept — same idea, three places.

Your dashboard tracks. It doesn’t control.

Every agent-observability tool I’ve used does the same useful thing: it records what happened. Spans, traces, a timeline of tool calls, latency, a green checkmark when the run returns 200. I’m not knocking it — you need it. But look at what it actually knows.

It knows a call was made. It does not know the call did the right thing.

There’s a survey number that captures the gap. A 2026 poll of 1,300+ AI professionals found 89% of organizations have observability in place, but only 62% can actually inspect what their agents do at each step (Cycles). Tracking is near-universal. The ability to check the result lags by a third of the field. That delta is exactly where silent-success drift lives.

And it compounds. The arithmetic everyone quotes is real: at 95% per-step correctness, a 10-step workflow lands right only ~60% of the time; at 85%, about 20%. Structured output helps a lot — Agentmelt’s production data puts action-success at 95–99% with structured output versus 70–85% parsing unstructured text (Agentmelt). But notice what structured output fixes: it guarantees the shape. It says nothing about whether the agent did the thing the shape describes. A perfectly-formed {"status": "updated"} from an agent that never called update is exactly as well-formed as an honest one.

So you can have observability, structured output, and a green dashboard, and still ship a lie. The missing piece is a step that sits between “agent finished” and “we accept it” and asks: prove it.

The shape of the fix: gate before accept

The idea fits in a sentence. Before you treat an agent’s result as success, run it past a contract — and if it can’t back up its claim, mark it REJECTED instead of letting a silent 200 through.

Three checks, deliberately asymmetric, because the three ways a result lies are different:

  • Check 1 — schema / contract. Is the output even shaped like success? Required fields present, right types, and values inside the allowed set — including an allow-list for things like target_id. This is the cheap one that catches the BFCL 2–5%: malformed JSON, a missing field, an id that was never permitted. The invoice-to-cust_9999 near-miss dies here.
  • Check 2 — claim vs evidence. Did the agent do what it says it did? Walk the actual tool-call trace. If the result claims "updated" but update_record was never called, that’s a claim without an action. This is the check that catches the agent narrating an effect it never produced — the failure observability literally cannot see, because no failing call ever happened.
  • Check 3 — post-condition probe (optional, read-only). Is the effect actually there? You hand it a small read-only function — a GET that reads the record back, a query that counts the row. If it can’t confirm the effect, REJECTED. This is the only check that touches the network, and it’s optional precisely because it needs a read-only way to verify.

The point of running all applicable checks and collecting every reason — instead of bailing on the first — is that a drifted result usually trips more than one. You want the full report, not the first red flag.

SuccessGate — the whole thing

Stdlib for the first two checks, so it runs on a bare machine with nothing installed. requests is imported lazily and only the optional level-3 probe would use it — and even then, the probe is your read-only function, not something the tool invents. No keys. No funds. No blockchain. It reads; it never writes.

#!/usr/bin/env python3
"""SuccessGate - verify an AI agent's claimed success BEFORE you accept it.
Three asymmetric checks between "the agent says it's done" and "we trust it":
  1. schema/contract   - shaped right (required fields, types, enums)
  2. claim-vs-evidence - every claimed effect maps to a real tool-call in the trace
  3. post-condition    - (optional) a read-only probe confirms the effect happened
Stdlib only; `requests` optional (level-3 probe). No keys, no funds. Read-only.
Run the self-test:  python success_gate.py
"""
from __future__ import annotations
import json, re
from dataclasses import dataclass, field
from typing import Any, Callable, Optional

try:
    import requests  # noqa: F401  -- OPTIONAL, only the level-3 probe uses it
    _HAS_REQUESTS = True
except ImportError:
    _HAS_REQUESTS = False


@dataclass
class GateResult:
    verdict: str = "ACCEPTED"
    reasons: list = field(default_factory=list)
    def fail(self, reason): self.verdict = "REJECTED"; self.reasons.append("FAIL: " + reason)
    def ok(self, note): self.reasons.append("ok: " + note)
    @property
    def accepted(self): return self.verdict == "ACCEPTED"


@dataclass
class Contract:
    """Machine-checkable definition of 'this result is real'.
    required: field -> type | enums: field -> allowed set (allow-list)
    claim_requires_action: claimed verb -> tool that MUST appear in the trace."""
    required: dict = field(default_factory=dict)
    enums: dict = field(default_factory=dict)
    claim_requires_action: dict = field(default_factory=dict)


def check_schema(result, contract, res):
    if isinstance(result, str):
        try:
            result = json.loads(result)
        except json.JSONDecodeError as e:
            res.fail(f"output is not valid JSON: {e}"); return None
    if not isinstance(result, dict):
        res.fail(f"output is {type(result).__name__}, expected an object"); return None
    for name, typ in contract.required.items():
        if name not in result:
            res.fail(f"missing required field '{name}'"); continue
        if not isinstance(result[name], typ):
            res.fail(f"field '{name}' is {type(result[name]).__name__}, "
                     f"expected {getattr(typ, '__name__', typ)}")
    for name, allowed in contract.enums.items():
        if name in result and result[name] not in allowed:
            res.fail(f"field '{name}'={result[name]!r} not in allowed set {sorted(allowed)}")
    if res.accepted:
        res.ok(f"schema valid ({len(contract.required)} required fields present)")
    return result


def check_claim_vs_evidence(result, trace, contract, res):
    called = {s.get("tool") for s in trace if isinstance(s, dict)}
    hay = (str(result.get("status", "")) + " " + str(result.get("summary", ""))).lower()
    for verb, tool in contract.claim_requires_action.items():
        if re.search(rf"\b{re.escape(verb)}\b", hay):
            if tool not in called:
                res.fail(f"claims '{verb}' but never called '{tool}' "
                         f"(tools actually called: {sorted(called) or 'none'})")
            else:
                res.ok(f"claim '{verb}' backed by a real '{tool}' call")


def check_post_condition(probe, res, label="post-condition"):
    if probe is None:
        return
    try:
        if probe():
            res.ok(f"{label}: effect confirmed by read-back")
        else:
            res.fail(f"{label}: effect NOT present on read-back (wrong-effect)")
    except Exception as e:
        res.fail(f"{label}: probe could not complete: {e}")


def verify(result, contract, trace=None, probe=None):
    """Accept the agent's success only if every applicable check passes."""
    res = GateResult()
    parsed = check_schema(result, contract, res)
    if parsed is not None:
        if trace is not None and contract.claim_requires_action:
            check_claim_vs_evidence(parsed, trace, contract, res)
        check_post_condition(probe, res)
    return res


def _print(label, r):
    print(f"[{r.verdict}] {label}")
    for line in r.reasons:
        mark = "    FAIL -" if line.startswith("FAIL: ") else "       ok -"
        print(f"{mark} {line.split(': ', 1)[1]}")
    print()


if __name__ == "__main__":
    print("SuccessGate self-test - no agent, no key, offline\n")
    contract = Contract(
        required={"status": str, "target_id": str},
        enums={"status": {"created", "updated", "sent"},
               "target_id": {"cust_1001", "cust_1002", "cust_1003"}},  # allow-list
        claim_requires_action={"created": "create_record",
                               "updated": "update_record",
                               "sent": "send_invoice"},
    )

    print("--- fixture 1: valid - created, with a real create_record call ---")
    _print("agent created cust_1001",
           verify({"status": "created", "target_id": "cust_1001",
                   "summary": "created the customer record"},
                  contract, trace=[{"tool": "create_record", "args": {"id": "cust_1001"}}]))

    print("--- fixture 2: invalid JSON - the agent returned a broken string ---")
    _print("agent output is malformed JSON",
           verify('{"status": "created", "target_id": "cust_1001"',  # missing }
                  contract, trace=[{"tool": "create_record"}]))

    print("--- fixture 3: claim without action - says updated, never called it ---")
    _print("agent claims updated, trace has only read_record",
           verify({"status": "updated", "target_id": "cust_1002",
                   "summary": "updated the record as requested"},
                  contract, trace=[{"tool": "read_record", "args": {"id": "cust_1002"}}]))

    print("--- fixture 4: wrong effect - status sent, target_id off the list ---")
    def probe_invoice_present():
        return False  # stand-in for a read-only GET that finds no such invoice
    _print("agent claims sent to cust_9999 (not allow-listed)",
           verify({"status": "sent", "target_id": "cust_9999",
                   "summary": "invoice sent successfully"},
                  contract, trace=[{"tool": "send_invoice", "args": {"id": "cust_9999"}}],
                  probe=probe_invoice_present))

How you’d wire it in

One call, right where you currently trust the agent’s word. You already have the four things it needs: the agent’s structured result, your contract, the tool-call trace your framework records, and (optionally) a read-only probe.

gate = verify(
    result=agent_result,           # what the agent returned
    contract=my_contract,          # what 'real success' means here
    trace=run.tool_calls,          # the actual calls your framework logged
    probe=lambda: crm.get(agent_result["target_id"]) is not None,  # read-only
)
if not gate.accepted:
    raise RuntimeError("SuccessGate rejected: " + "; ".join(gate.reasons))
# only here do you mark the task done / advance the workflow

The raise is your point of no return in reverse: if the gate rejects, the “mark done” code never runs. The reasons list is what shows up in your logs, so when something is rejected at 3am you see whyclaims 'updated' but never called 'update_record', not just a red number.

The contract is the whole game, and it’s the part you write. required is the shape. enums is where allow-lists live — the single most useful field, because “an id the agent invented” is the most common drift I see. claim_requires_action maps each claim verb to the tool that must back it. Start with the one workflow that’s burned you, and grow it.

Run it

python success_gate.py

No account, no key, no network. The first two checks are pure stdlib; the self-test’s level-3 probe is a local stub that returns False so you can see a wrong-effect rejection without standing up a server. pip install requests only if you want the optional probe to make a real read-only GET in your own integration.

Output

SuccessGate self-test - no agent, no key, offline

--- fixture 1: valid - created, with a real create_record call ---
[ACCEPTED] agent created cust_1001
       ok - schema valid (2 required fields present)
       ok - claim 'created' backed by a real 'create_record' call

--- fixture 2: invalid JSON - the agent returned a broken string ---
[REJECTED] agent output is malformed JSON
    FAIL - output is not valid JSON: Expecting ',' delimiter: line 1 column 47 (char 46)

--- fixture 3: claim without action - says updated, never called it ---
[REJECTED] agent claims updated, trace has only read_record
       ok - schema valid (2 required fields present)
    FAIL - claims 'updated' but never called 'update_record' (tools actually called: ['read_record'])

--- fixture 4: wrong effect - status sent, target_id off the list ---
[REJECTED] agent claims sent to cust_9999 (not allow-listed)
    FAIL - field 'target_id'='cust_9999' not in allowed set ['cust_1001', 'cust_1002', 'cust_1003']
       ok - claim 'sent' backed by a real 'send_invoice' call
    FAIL - post-condition: effect NOT present on read-back (wrong-effect)

What’s deterministic here isn’t a number — it’s the verdicts. Fixture 1 is ACCEPTED: the schema is valid and the created claim is backed by a real create_record call. Fixtures 2, 3, and 4 are each REJECTED, for three different reasons: broken JSON, a claim with no matching tool-call, and an effect that the contract’s allow-list and the probe both refuse. Three distinct lies, three distinct catches, none of them a silent 200.

What this is not

I’d rather undersell this than have you trust it past its range. A gate that gives false confidence is worse than no gate.

  • It’s not a replacement for evals or tests. SuccessGate is a runtime gate on one result at a time. It doesn’t tell you your agent is good in aggregate, doesn’t measure quality across a dataset, and doesn’t replace the offline eval suite that tells you whether a prompt change regressed anything. Run both. They answer different questions.
  • It can’t catch semantically-correct-but-undesirable without a contract that says so. If the agent updates the right record with the wrong business decision — a legal-but-dumb action — and your contract doesn’t encode a rule that forbids it, the gate passes it. The gate is exactly as good as the contract you write. An empty contract verifies nothing.
  • The level-3 probe needs a read-only way to check the effect. “Did the invoice send” is only verifiable if you have a read endpoint to ask. For effects with no read-back — fire-and-forget side effects, third-party actions you can’t query — check 3 simply doesn’t apply, and you’re leaning on checks 1 and 2.
  • The self-test is a mechanics demo, not a benchmark. The four fixtures show how each check fires. They are not a measurement of how often real agents drift — for that, see the arXiv and BFCL numbers up top, which are real studies, not my four hand-built cases.

None of that dents the core claim: a result the agent claims is success, run past a contract before you accept it, is the difference between catching cust_9999 at the gate and finding it downstream a week later.

The one question I’m still chewing on

The hard line is between effects a contract can check and effects only a human can judge. target_id in allow_list is crisp — machine-checkable, no argument. “Is this invoice correct” is not, at least not obviously: the amount, the line items, the tax, the customer’s actual intent. I’ve been encoding more of “correct” as post-conditions — read the record back, assert the total matches the order it references — and it works until the assertion itself needs judgment. Where do you draw the line? How does your team encode “the invoice is right” as a machine-checkable post-condition instead of a human eyeballing it? If you’ve pushed that boundary further than an allow-list and a read-back, I’d genuinely like to see how — drop it in the comments.

This is part three of a series on pre-execution control for agents: don’t let them overspend, don’t let them send a bad transaction, and don’t accept a success they never achieved.


Written with AI assistance and reviewed/edited by a human. The code in this post was run before publishing; the output block above is from a real run (2026-06-08). External figures — BFCL v3’s 2–5% invalid tool calls (via Future AGI), Agentmelt’s 95–99% vs 70–85% structured-vs-unstructured action success, the arXiv 2602.22302 figures (1,980 sessions, 5.2–6.8 soft violations/session, 200 scenarios), and Cycles’ 89%/62% survey split — are as reported by those sources; check the originals before quoting.