A 58% Win-Rate Over Zero Closed Trades: Recompute Agent Scorecard
Recompute the agent scorecard from its primary event journal before you trust it: a self-reported metric is the actor grading itself. scorecard_reconcile.py re-derives every metric independently and flags divergence. On the divergent fixture a claimed 58% win-rate sits over zero closed trades, yielding 5 DIVERGENT and 1 UNSUPPORTED metric and exit 1, blocking the add-capital decision.
AI disclosure: I wrote
scorecard_reconcile.pywith an AI assistant and ran it myself, offline, before publishing. Every number in the output blocks below is pasted from a real local run on Python 3.13.5, stdlib only, on the synthetic fixtures included in this post. I checked the exit codes, hashed the STDOUT twice to confirm it is byte-for-byte deterministic, and edited every line. The external figures (the SEC’s $12.3M case) are the SEC’s numbers, not mine, and I link the primary source. I label which numbers are theirs and which are mine.
In short:
- A self-reported metric and a real one look identical on a dashboard. The agent prints both.
scorecard_reconcile.pyreads the scorecard the agent claimed plus the journal of events it logged, then recomputes each metric from the journal and flags anything that does not reconcile.- On the clean fixture, all 6 metrics MATCH, 0 flagged, exit 0. On the divergent fixture, a claimed 12-trade, 58.3% win-rate scorecard sits over a journal with zero closing events: 5 metrics DIVERGENT, 1 UNSUPPORTED, exit 1.
- The 58.3% is the dangerous one. A win-rate over zero closed trades is not 58%. It is undefined, and a number you cannot disprove is worse than a number that is merely wrong.
- Stdlib only (
json,sys,hashlib). No network, no model, no exec. The run is byte-for-byte deterministic. The code and both fixtures are in this post.
The case that makes this concrete
On 28 May 2026 the SEC charged Nathan Fuller, founder of Privvy Investments, over a crypto scheme it valued at about $12.3 million raised from roughly 150 investors (SEC litigation release LR-26558, reported by CoinDesk on 30 May 2026). The pitch was proprietary AI trading bots running high-frequency arbitrage. According to the complaint, only about $380,000, roughly 3% of the money, ever bought crypto at all, those trades ran without the advertised bots, and they made no profit. Investors were kept calm with fake account statements and fabricated correspondence. Those are the SEC’s figures, from their filing, not mine.
Strip out the fraud and a quieter version of the same shape shows up in honest setups every week. A trading agent prints a dashboard. The dashboard says 4 active days, 12 trades, 58.3% win-rate. Nobody re-derives those numbers from the exchange fills. The statement and the reality were never reconciled. Fuller’s was a deliberate fabrication; most are just an agent confidently reporting work that the journal does not back. The defense is the same in both: do not accept the scorecard the actor printed about itself. Recompute it from the primary events first.
The claim, sharp enough to argue with
Here is the falsifiable version: a self-reported metric is the actor grading itself, and a scorecard you did not recompute is not evidence. Win-rate, trade count, “ran for N days,” realized PnL: every one of those is emitted by the agent, from the agent’s own view of what it did. If that view is wrong, optimistic, or invented, the metric inherits the error and presents it as a clean number on a chart.
If this claim were false, the recomputed scorecard would always equal the claimed one, and a tool like this would be pointless. You could read the dashboard and move on. The tool earns its place precisely when the two disagree, and the divergent fixture below is one keystroke of proof that they can disagree completely.
The control is not “log the metrics and watch them.” That is tracking. Tracking tells you what the agent said about itself. Control is recomputing the metric from the primary journal and gating the next decision, keep it running, scale it, add capital, on whether the recomputation agrees. This is the post-hoc sibling of the pre-execution gate for AI agents: same idea, a check that has to pass before an action, applied to the aggregate report instead of a single call.
What the tool recomputes, and how
The input is two small JSON files. The first is claimed.json, the scorecard the agent printed about itself. The second is evidence.json, the primary journal: an events list of open and close records, each with a timestamp and, for closes, a PnL. Think exchange fills, not the agent’s summary of them.
The tool ignores the claimed numbers and rebuilds each metric from the journal with one defensible definition per metric:
- A trade is a closed position, a
closeevent. Open attempts that never fill are not trades. - A win is a close with PnL above zero, a loss is a close below zero.
- Active days are the count of distinct ISO dates across all events, taken as the first ten characters of the timestamp. No clock, no date library, no timezone math.
- Win-rate is wins over decided trades. Over zero decided trades it is
None, undefined, not zero. - Realized PnL is the sum of PnL over closes, rounded to a cent.
Tolerances are explicit so the verdict is not a matter of taste: counts must match exactly, ratios within 0.005 (half a percentage point), money within one cent. Here is the whole tool.
#!/usr/bin/env python3
"""scorecard_reconcile.py - recompute an agent's self-reported scorecard from its
primary event journal BEFORE you trust the report (or add capital / scale / keep it on).
A self-reported metric is the actor grading itself: win-rate, "ran for N days",
trade count, realized PnL - printed by the same agent whose work they describe.
This tool ignores the claimed numbers and INDEPENDENTLY recomputes each one from
the primary event journal (the fills / closes / activity the agent actually
logged), then flags every metric that does not reconcile.
Two failure shapes, not one:
DIVERGENT - claimed value != value recomputed from evidence.
UNSUPPORTED - claimed value cannot be derived from the journal at all
(a win-rate over zero closed trades is not 58%, it is
undefined; the number is unfalsifiable, which is worse).
offline / keyless / read-only / zero-network. stdlib only (json, sys, hashlib).
The journal is read, never run. Nothing is fetched, nothing is sent.
Exit 0 = every claimed metric is supported by evidence AND matches -> report trustworthy
Exit 1 = at least one metric DIVERGENT or UNSUPPORTED -> do NOT trust the report
Exit 2 = bad input
Usage:
python3 scorecard_reconcile.py <claimed.json> <evidence.json>
"""
import json
import sys
import hashlib
RATIO_TOL = 0.005 # 0.5 percentage points
MONEY_TOL = 0.01 # one cent
class BadInput(Exception):
pass
def load_json(path):
try:
with open(path, "r") as fh:
return json.load(fh)
except FileNotFoundError:
raise BadInput("file not found: %s" % path)
except json.JSONDecodeError as exc:
raise BadInput("invalid json in %s: %s" % (path, exc))
def as_number(value, field):
if isinstance(value, bool) or not isinstance(value, (int, float)):
raise BadInput("metric %r is not numeric: %r" % (field, value))
return float(value)
def recompute(events):
"""Independently rebuild every metric from the primary journal."""
if not isinstance(events, list):
raise BadInput("evidence.events must be a list")
closes, days = [], set()
for ev in events:
if not isinstance(ev, dict):
raise BadInput("event is not an object: %r" % (ev,))
ts = ev.get("ts")
if ts:
days.add(str(ts)[:10]) # ISO date portion; no clock, no parse libs
if ev.get("type") == "close":
closes.append(ev)
wins = sum(1 for e in closes if as_number(e.get("pnl", 0), "pnl") > 0)
losses = sum(1 for e in closes if as_number(e.get("pnl", 0), "pnl") < 0)
decided = wins + losses
win_rate = (wins / decided) if decided > 0 else None # None => undefined / unsupported
realized = round(sum(as_number(e.get("pnl", 0), "pnl") for e in closes), 2)
return {
"trades": float(len(closes)),
"active_days": float(len(days)),
"wins": float(wins),
"losses": float(losses),
"win_rate": win_rate,
"realized_pnl": realized,
"_supporting_events": len(events),
"_closing_events": len(closes),
}
# metric name -> comparison kind
SPEC = [
("trades", "count"),
("active_days", "count"),
("wins", "count"),
("losses", "count"),
("win_rate", "ratio"),
("realized_pnl", "money"),
]
def compare(name, kind, claimed, recomputed):
if recomputed is None:
return "UNSUPPORTED"
if kind == "count":
return "MATCH" if int(round(claimed)) == int(round(recomputed)) else "DIVERGENT"
if kind == "ratio":
return "MATCH" if abs(claimed - recomputed) <= RATIO_TOL else "DIVERGENT"
if kind == "money":
return "MATCH" if abs(claimed - recomputed) <= MONEY_TOL else "DIVERGENT"
raise BadInput("unknown metric kind: %s" % kind)
def fmt(kind, value):
if value is None:
return "UNDEFINED"
if kind == "count":
return str(int(round(value)))
if kind == "ratio":
return "%.3f" % value
return "%.2f" % value
def build_report(claimed_doc, evidence_doc):
if not isinstance(claimed_doc, dict) or "scorecard" not in claimed_doc:
raise BadInput("claimed.json must contain a 'scorecard' object")
sc = claimed_doc["scorecard"]
if not isinstance(sc, dict):
raise BadInput("'scorecard' must be an object")
if not isinstance(evidence_doc, dict) or "events" not in evidence_doc:
raise BadInput("evidence.json must contain an 'events' list")
truth = recompute(evidence_doc["events"])
agent = str(claimed_doc.get("agent", "unknown-agent"))
lines = []
lines.append("SCORECARD RECONCILE - %s" % agent)
lines.append("primary journal: %d events (%d closing) - recomputed independently"
% (truth["_supporting_events"], truth["_closing_events"]))
lines.append("")
lines.append("%-14s %12s %12s %s" % ("metric", "claimed", "from-journal", "verdict"))
lines.append("%-14s %12s %12s %s" % ("-" * 14, "-" * 12, "-" * 12, "-------"))
divergent, unsupported = 0, 0
for name, kind in SPEC:
if name not in sc:
raise BadInput("scorecard missing metric: %s" % name)
claimed = as_number(sc[name], name)
recomputed = truth[name]
verdict = compare(name, kind, claimed, recomputed)
if verdict == "DIVERGENT":
divergent += 1
elif verdict == "UNSUPPORTED":
unsupported += 1
note = ""
if verdict == "UNSUPPORTED":
note = " (0 closing events -> ratio is undefined / unfalsifiable)"
lines.append("%-14s %12s %12s %s%s"
% (name, fmt(kind, claimed), fmt(kind, recomputed), verdict, note))
flagged = divergent + unsupported
lines.append("")
lines.append("flagged: %d (%d divergent, %d unsupported)" % (flagged, divergent, unsupported))
if flagged == 0:
lines.append("DECISION GATE: scorecard reconciles with evidence -> trust permitted (exit 0)")
code = 0
else:
lines.append("DECISION GATE: scorecard does NOT reconcile -> do NOT trust report; "
"block continue/add-capital (exit 1)")
code = 1
return "\n".join(lines), code
def main(argv):
if len(argv) != 3:
sys.stderr.write("usage: scorecard_reconcile.py <claimed.json> <evidence.json>\n")
return 2
try:
claimed_doc = load_json(argv[1])
evidence_doc = load_json(argv[2])
body, code = build_report(claimed_doc, evidence_doc)
except BadInput as exc:
sys.stderr.write("bad input: %s\n" % exc)
return 2
digest = hashlib.sha256(body.encode("utf-8")).hexdigest()
sys.stdout.write(body + "\n")
sys.stdout.write("REPORT-SHA256: %s\n" % digest)
return code
if __name__ == "__main__":
sys.exit(main(sys.argv))
A scorecard that reconciles
The honest fixture has a journal of 14 events, 10 of them closes, and a scorecard that genuinely describes them. Run it:
$ python3 scorecard_reconcile.py fixtures/clean_claimed.json fixtures/clean_evidence.json
SCORECARD RECONCILE - trader-honest
primary journal: 14 events (10 closing) - recomputed independently
metric claimed from-journal verdict
-------------- ------------ ------------ -------
trades 10 10 MATCH
active_days 3 3 MATCH
wins 6 6 MATCH
losses 4 4 MATCH
win_rate 0.600 0.600 MATCH
realized_pnl 125.50 125.50 MATCH
flagged: 0 (0 divergent, 0 unsupported)
DECISION GATE: scorecard reconciles with evidence -> trust permitted (exit 0)
REPORT-SHA256: 15a33ee3894142fdcec6c06589e0d9aff09d1331ca7a91b88229d2d3b6a10aba
All 6 metrics MATCH, exit 0. This is what passing looks like, and you want it to be boring. The claimed 0.600 win-rate is the same 0.600 the tool derived from the fills, the claimed $125.50 is the sum it computed over the 10 closes. Nothing to argue with. The gate opens.
12 trades claimed, zero in the journal
Now the divergent fixture. The scorecard claims 4 active days, 12 trades, 7 wins, 5 losses, a 58.3% win-rate, and $842.00 realized. The journal underneath it holds three open events on a single day, all rejected, and not one close.
$ python3 scorecard_reconcile.py fixtures/divergent_claimed.json fixtures/divergent_evidence.json
SCORECARD RECONCILE - trader-x (self-reported dashboard)
primary journal: 3 events (0 closing) - recomputed independently
metric claimed from-journal verdict
-------------- ------------ ------------ -------
trades 12 0 DIVERGENT
active_days 4 1 DIVERGENT
wins 7 0 DIVERGENT
losses 5 0 DIVERGENT
win_rate 0.583 UNDEFINED UNSUPPORTED (0 closing events -> ratio is undefined / unfalsifiable)
realized_pnl 842.00 0.00 DIVERGENT
flagged: 6 (5 divergent, 1 unsupported)
DECISION GATE: scorecard does NOT reconcile -> do NOT trust report; block continue/add-capital (exit 1)
REPORT-SHA256: 7d1a10fd12a95a13329ffd7c34d6d869411a51d2e4196c0be6c20dec4d06f781
Six metrics flagged, exit 1. The dashboard told a four-day story. The events show one day of failed attempts and zero realized anything. If you had read the scorecard and added capital, you would have funded a $842 PnL that does not exist in the record the agent itself kept.
Two failure shapes, and why one is worse
Most checks have a single failure mode: pass or fail. This one separates two, because they call for different reactions.
DIVERGENT is the loud one. Claimed 12, journal says 0. Claimed $842, journal says $0.00. The numbers disagree and the disagreement is concrete. You can chase it: a logging bug, a double-count, a wrong window, a fabrication. There is a thread to pull.
UNSUPPORTED is the quiet one, and it is worse. The claimed 58.3% win-rate cannot be derived from the journal at all, because there are zero decided trades. A win-rate is wins over decided trades, and over zero trades that ratio is undefined, not 58.3% and not 0%. The agent printed a precise-looking number for a quantity the evidence cannot define. You cannot disprove 58.3% by pointing at the fills, because the fills do not speak to it. A number you cannot falsify is not a weak metric, it is a non-metric wearing a metric’s clothes, and it is exactly the kind of figure that survives a review because it looks specific. The tool refuses to call it DIVERGENT (that would imply the right answer was some other number) and labels it UNSUPPORTED instead.
That distinction is the whole reason the tool prints two counters instead of one.
The exit code is the gate, not the chart
The point of returning 0, 1, or 2 is that a pipeline can read it without reading prose. Wire it before the decision that spends money or scope:
if python3 scorecard_reconcile.py claimed.json evidence.json; then
echo "scorecard reconciles - proceed with the add-capital review"
else
echo "scorecard does not reconcile - hold; do not scale on this report"
fi
Exit 0 lets the next step run. Exit 1 holds it. Exit 2 means the input was malformed (events that are not a list, a missing scorecard, a non-numeric metric), and a malformed reconciliation should never read as “passed.” Try it: feed it a bad file and it exits 2 with bad input: evidence.events must be a list, run it with no arguments and it prints usage and exits 2. A gate that cannot tell “all clear” from “I could not check” is not a gate.
Deterministic, so it can live in CI
The report ends with a SHA-256 of its own body. Run the clean fixture twice and the STDOUT hashes to 64ac7d45afb235bae3fcfac28e98fd3ae8b6a4c5f43e6696ebd2b723684159b6 both times; the divergent fixture is 64246aad569193ce61d004e66d79cb256f6a8d8e7ebd8feda9dfb8c41bf8d52e both times. No timestamps in the output, no map ordering, no floating-point surprise past the cent. That matters because a reconciliation you cannot reproduce is just a second opinion. This one you can pin in a test and diff.
What this is NOT
I would rather you know the edges than discover them on your own logs.
- It is not an audit of the trading logic. It checks that the scorecard agrees with the journal. It says nothing about whether the strategy is good, whether the fills were priced fairly, or whether the agent should have opened those positions at all. A perfectly reconciled scorecard can describe a terrible strategy.
- It trusts the journal. The whole method assumes
evidence.jsonis the primary record, exchange fills or a settlement log, not a second file the same agent wrote. If the agent forges the journal too, this catches nothing. So the real question is upstream: is your evidence a source the agent cannot rewrite? Pull fills from the exchange API or the chain, not from the agent’s own summary. - It is not on-chain attestation or settlement verification. It does no signature checks and reads no blockchain. For “did this fill really happen and settle,” you want the exchange’s records or a node, then feed those in as the journal. Pair it with the Grok tx canary, which gates a single transaction before broadcast, while this reconciles the aggregate after the fact.
- A flag is a signal, not a verdict on intent. DIVERGENT can be a logging bug as easily as a lie. The tool tells you the report and the record disagree. Why they disagree is your investigation.
- The metric definitions are mine, and arguable. I count a trade as a close, not an open. If your book counts entries, or partial fills, or funding events, change the definitions in
recomputeto match your venue. The contract is “recompute from primary events with explicit rules,” not these six exact rules.
Where this sits next to the other gates
This is one more pre-decision check in a series, and it is worth saying how it differs from its closest neighbors so you use the right one.
It is not your-agent-returns-200-and-lies. That tool verifies a single call: a clean 200 whose effect was wrong. This one verifies an aggregate over many events, the scorecard built from all of them. Different object, different scale.
It is not the green-checkmark auditor either. That one asks whether a passing test actually exercises the code or just mirrors it. Here the question is whether a claimed KPI is backed by the primary events. Both share a suspicion of green that was never earned, applied to different artifacts.
And it sits a step downstream of the waste-probe for tokens burned after a failure: that one measures cost the agent already spent, this one questions the success the agent claims for what it spent. Cost and truth are separate audits, and an agent can fail both at once.
The question I am still chewing on
The method is only as good as the journal you feed it, and that is the part I have not solved cleanly. For a centralized exchange you can pull fills from the venue’s API, a record the agent cannot rewrite. For an agent that logs its own activity to a file it also controls, the journal and the scorecard come from the same hand, and reconciling one against the other proves consistency, not truth.
So here is the real open question for anyone running a trading or ops agent: what is the primary journal for your bot, the exchange’s fills and the chain’s settlements, or the activity log the agent itself prints? If it is the latter, what stops the agent from making both agree? I have a few ideas (signed fills, a journal written by a separate process the agent cannot reach) and no clean rule. Drop how you source your evidence in the comments, I read every one.
Follow for the next runnable check in this series on controlling agents before you trust them.
Written by Alexey Spinov. AI-assisted, human-verified: the tool, all five fixtures, and every number above come from a real local run on 2026-06-29 (Python 3.13.5, stdlib only, offline). I ran it, checked the exit codes (0 / 1 / 2), hashed the STDOUT twice to confirm determinism, and edited every line. The SEC figures ($12.3M, ~150 investors, ~$380K/3% actually traded) are the SEC’s, from litigation release LR-26558 and CoinDesk’s 30 May 2026 report, not my measurements. I label which numbers are theirs and which are mine.