Your LLM Router Logged the Wallet Key. It Already Left.


AI-agent secrets are in transit when a request hits a third-party LLM router or MCP proxy, and that router’s audit log is not a control: by the time it logs the request, the credential already crossed your perimeter in plaintext. The fix is to redact at egress, before the bytes leave.

In short:

  • A secret going to your own model provider on an Authorization header is the expected path. The same secret going to a router, gateway, or MCP proxy is a leak, because that host reads your plaintext.
  • boundary_leak_probe.py reads one JSON egress map and classifies every secret-bearing field by destination trust. On the leaky fixture: 3 requests, 2 of them to third-party intermediaries, 5 fields crossing the boundary, 6 rule-hits, 2 critical wallet secrets. Exit 1.
  • The 2 critical ones are an Ethereum private key and a BIP-39 mnemonic, sitting in MCP tool-call arguments headed to a proxy. Signer material should never transit any middleman.
  • Stdlib only (sys, json, re). No network, no model, no exec. The run is byte-for-byte deterministic.
  • A hit is a SIGNAL, not a confirmed live secret. The code and both fixtures are in this post.

The incident that made this worth measuring

In April 2026 a group of researchers, including Chaofan Shou, published “Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain” (arXiv 2604.08407, posted 9 April 2026). They pointed real agent traffic at 428 commodity LLM routers. Nine of them injected code into the responses. Seventeen reached for the researchers’ own AWS credentials. Their framing of the mechanism is the part that stuck with me: these routers “operate as application-layer proxies with full plaintext access to every in-flight JSON payload,” and no provider enforces cryptographic integrity between the client and the upstream model.

CoinDesk covered it the same week and carried a blunter line from Shou: 26 routers were “secretly injecting malicious tool calls and stealing creds,” and one of them “drained our client’s $500k wallet” (CoinDesk, 13 April 2026). That $500k and those router counts are their numbers, from their measurement, not mine. I am citing them for context. Everything I claim about my own tool comes from a run I will paste in full.

I read that paper on a Tuesday and went looking for the part of my own stack that assumed the router was trusted. I found it fast. We route through a gateway for failover and cost tracking. The gateway has a dashboard. The dashboard has a request log. And I had quietly been treating that log as a safety net: if something leaks, I will see it there.

That assumption is backwards, and saying it out loud is the whole point of this post.

The claim, sharp enough to argue with

Here is the falsifiable version: a router’s audit log is a receipt, not a brake. By the time a credential shows up in the router’s log, the router process has already read it in plaintext. Logging happens on the far side of the boundary. The secret is gone. You cannot un-send it by reviewing a log entry, the same way you cannot un-mail a letter by reading the carbon copy.

If that claim were false, redaction would not matter and a probe like mine would be pointless. You could just watch the log and rotate after the fact. But “rotate after the fact” assumes the window between send and detection is harmless, and for a wallet private key that window is exactly long enough to sign one transaction. The signer secret is not like an API key you rotate on Monday. Once a third party has it, the funds are a sign_tx call away.

So the control has to move upstream of the send. Classify the destination first. Redact anything that should not cross. Then emit the bytes. The log, if you keep one, becomes a record of what you allowed out, not a tripwire you read after the damage.

What the probe actually does

The input is one JSON file I call an egress map: the outbound requests your agent emits, with their destination host, kind, headers, and body. You can dump this from a request interceptor, a test harness, or by hand. The probe never makes a request. It reads the map statically.

Two ideas do the work.

First, destination trust. You declare first_party_hosts: the hosts you contract with directly, your own backend or the model provider itself. An Authorization header to one of those is the expected credential path, so the probe does not scream about it. Every other host, the routers and gateways and MCP proxies in the middle, is third-party by default. A secret sitting in the body or tool-call arguments headed there has crossed a boundary you do not own.

Second, signer material is always-leak. An Ethereum private key or a BIP-39 mnemonic must never transit any intermediary, first-party or not. There is no legitimate path where your agent mails a seed phrase through a proxy. If the probe sees one, it is CRITICAL regardless of destination.

Here are the rules and the value scanner:

import sys, json, re

# Secret shapes. critical = signer material that must NEVER transit at all.
SECRET_RULES = [
    ("eth_private_key", re.compile(r"\b0x[0-9a-fA-F]{64}\b"),                  True),
    ("bip39_mnemonic",  re.compile(r"\b(?:[a-z]{3,8}\s+){11,23}[a-z]{3,8}\b"), True),
    ("aws_access_key",  re.compile(r"\bAKIA[0-9A-Z]{16}\b"),                   False),
    ("openai_key",      re.compile(r"\bsk-[A-Za-z0-9]{20,}\b"),                False),
    ("bearer_token",    re.compile(r"Bearer\s+[A-Za-z0-9._\-]{16,}"),          False),
    ("github_pat",      re.compile(r"\bghp_[A-Za-z0-9]{36}\b"),                False),
]
# A value already neutralised before send: env ref / vault handle / masked.
SAFE_REF = re.compile(r"^(\$\{?[A-Z0-9_]+\}?|\$VAULT_REF:[\w:\-]+|sk-\*{3,}|\*{4,}|<REDACTED[:>])")

def scan_value(val):
    if SAFE_REF.match(val.strip()):
        return []                       # already redacted / handle-referenced
    return [(kind, crit) for kind, rx, crit in SECRET_RULES if rx.search(val)]

The SAFE_REF rule matters more than it looks. A value like ${OPENAI_KEY} or $VAULT_REF:openai is a handle, not a secret: the real value gets substituted at the trusted edge, not carried in your agent’s payload. If you already pass handle references to your router and let your own egress proxy swap them in, you are most of the way to safe. The probe rewards that by staying quiet.

The classifier walks every string leaf in each request, scans it, and applies the leak rule:

def classify(spec):
    first_party = set(spec.get("first_party_hosts", []))
    rows = []
    for req in spec.get("requests", []):
        trust = "first_party" if req.get("to", "") in first_party else "third_party"
        payload = {k: v for k, v in req.items() if k not in ("to", "kind", "id")}
        for jp, val in walk(payload):
            hits = scan_value(val)
            if not hits:
                continue
            is_auth = jp.lower().startswith("headers.authorization")
            is_crit = any(c for _, c in hits)
            leak = (trust == "third_party") or is_crit
            # an expected first-party Authorization is NOT a leak (unless critical)
            if trust == "first_party" and is_auth and not is_crit:
                leak = False
            rows.append({"id": req.get("id", "?"), "kind": req.get("kind", "?"),
                         "trust": trust, "path": jp, "kinds": [k for k, _ in hits],
                         "critical": is_crit, "leak": leak})
    return len(spec["requests"]), rows

The walk helper is the obvious recursive descent over dicts and lists, yielding a JSON path and a string for every leaf. The full file, including the --redact mode I show below, is about 95 lines. I am skipping the boilerplate here, not hiding it.

The run, pasted whole

Two fixtures. The clean one still talks to two third-party intermediaries, but it routes real secrets only to first-party hosts and sends those intermediaries handle references instead, so nothing crosses. The leaky one is shaped like a real agent that got lazy: an API gateway in the middle carrying a GitHub PAT on its Authorization header, an AWS key buried in a system message, an OpenAI key in metadata, and an MCP proxy receiving a wallet private key plus a mnemonic in tool-call arguments.

$ python3 boundary_leak_probe.py fixtures/egress_clean.json
requests=3  third_party_intermediaries=2
secret_bearing_fields_crossing_boundary=0  rule_hits=0  critical_signer_material=0
redaction_gate_would_block=0  router_audit_log_sees_them_only_AFTER_egress=0
exit=0

$ python3 boundary_leak_probe.py fixtures/egress_leaky.json
requests=3  third_party_intermediaries=2
secret_bearing_fields_crossing_boundary=5  rule_hits=6  critical_signer_material=2
redaction_gate_would_block=5  router_audit_log_sees_them_only_AFTER_egress=5
  leak      third_party  llm-gateway   req=r2   aws_access_key         body.messages[0].content
  leak      third_party  llm-gateway   req=r2   openai_key             body.metadata.upstream_key
  leak      third_party  llm-gateway   req=r2   bearer_token+github_pat headers.Authorization
  CRITICAL  third_party  mcp-proxy     req=r3   eth_private_key        body.tool_calls[0].arguments.private_key
  CRITICAL  third_party  mcp-proxy     req=r3   bip39_mnemonic         body.tool_calls[1].arguments.mnemonic
exit=1

Now the honesty about the numbers, because this is where a lazy headline would lie. The probe found 5 distinct fields crossing to third-party intermediaries, but 6 rule-hits. Why the mismatch? One field, the gateway’s Authorization header, tripped two rules at once: it looks like a generic bearer token and it is a GitHub PAT wrapped inside it. That is one leak, two signals. It is not six different secrets, and I am not going to call it six. The number that matters most is the small one: 2 critical fields, the wallet private key and the mnemonic, both headed to an MCP proxy that has no business seeing either.

One more detail that is easy to miss. Request r1 in the leaky fixture sends a real-looking Bearer sk-... to api.openai.com, which is a first-party host. The probe does not flag it. That is the point of destination trust: an auth header to your provider is the credential doing its job. The same shape to a router is the credential getting stolen. A flat secret scanner cannot tell those two apart. This one is built to. One honest caveat: that trust is host-level, not header-level. The probe trusts the destination, so a non-critical secret that lands anywhere in a first-party request, body included, also gets a pass; only signer material overrides the trust and leaks regardless. So your first_party_hosts list is the whole ballgame. Keep it tight, because the tool trusts those hosts with whatever you send them.

Bad input is a third exit code, so a CI step can branch on it:

$ python3 boundary_leak_probe.py            # no argument
usage: boundary_leak_probe.py [--redact] <egress_map.json>
exit=2

And the run is deterministic. I hashed the leaky STDOUT twice:

$ python3 boundary_leak_probe.py fixtures/egress_leaky.json | shasum -a 256
28c5eb9ff8e7ad0abc6b1ad67a617cdd5fdaa09bfce26d3f9f00022217e0a6c5  -
28c5eb9ff8e7ad0abc6b1ad67a617cdd5fdaa09bfce26d3f9f00022217e0a6c5  -

Same bytes both times. That matters for a gate: a check that flickers is a check people disable.

Redact at the boundary, then prove the gate closed

Reporting a leak is the easy half. The thesis was that the control belongs at egress, so the probe has a --redact mode that prints the masked map a boundary gate would actually emit. It leaves the first-party Authorization alone and masks everything that would cross:

$ python3 boundary_leak_probe.py --redact fixtures/egress_leaky.json
...
    "to": "router.3rdparty.ai",
    "headers": { "Authorization": "<REDACTED:bearer_token>" },
    "body": {
      "messages": [ { "role": "system", "content": "<REDACTED:aws_access_key>" } ],
      "metadata": { "upstream_key": "<REDACTED:openai_key>" }
    }
...
    "to": "mcp-proxy.partner.io",
    "arguments": { "private_key": "<REDACTED:eth_private_key>", ... }
    "arguments": { "mnemonic": "<REDACTED:bip39_mnemonic>" }

Then the part I like. Feed that masked map back into the probe:

$ python3 boundary_leak_probe.py --redact fixtures/egress_leaky.json | python3 boundary_leak_probe.py /dev/stdin
requests=3  third_party_intermediaries=2
secret_bearing_fields_crossing_boundary=0  rule_hits=0  critical_signer_material=0
redaction_gate_would_block=0  router_audit_log_sees_them_only_AFTER_egress=0
exit=0

Exit 0, and notice it still lists two third-party intermediaries: the hops are still there, but zero secrets now cross to them. The masked tokens match SAFE_REF, so the second pass sees nothing to flag, and --redact masks every field the audit scans, not just headers and body, so the round trip holds for more than this one fixture’s exact shape. That round trip is the difference between watching a log and holding a brake. The log tells you a secret left. The redact pass means it never did. It is still a static regex heuristic, though, not a proof your bytes are clean.

Where this sits, and what I have already written about

This is the fifth tool in a series, and I keep the axes deliberately separate so they stack instead of overlap. Earlier ones looked at a secret that ships in a build artifact (what npm pack actually publishes), the blast radius of a key if it leaks (how much breaks, by scope), the identity and version of an MCP manifest, and contamination in an eval harness. None of those asked the question this one asks: of the requests my agent is about to send, which destinations are trusted, and which secret-bearing fields are about to cross to a host I do not control? The object here is the outbound trace and its destination, not a file on disk, not a manifest, not a scope score. New axis, new tool.

What this is NOT

I would rather you trust the limits than oversell the wins.

It is not a live secret scanner. Every hit is a SIGNAL, a regex match on a shape. The 0x... could be a transaction hash someone pasted, not a private key. Confirm anything that matters against your own vault. The probe will not tell you whether a key is real or revoked.

It is not a runtime interceptor. It reads a static egress map. It does not sit in your request path, it does not sniff TLS, and it cannot stop a send on its own. To make it a real gate, you wire its exit code into the place that emits the bytes, or you run the --redact transform there. The probe is the policy; the plumbing is yours.

It is not a replacement for mTLS or a gateway’s own controls. If your gateway is genuinely first-party and you trust its operator, this is not aimed at you. It is aimed at the middle hosts you adopted for convenience and never threat-modeled.

And the matching is heuristic. The loudest false positive is the mnemonic rule: it matches any run of twelve to twenty-four short lowercase words, with no wordlist or checksum check, so an ordinary English sentence in a prompt can trip a CRITICAL bip39_mnemonic hit and force exit 1, even on a first-party request, because signer material overrides the trust model. A hex blob that is not a key trips eth_private_key the same way. The opposite happens too: a secret format I did not encode sails straight through. The first_party_hosts list is exact-string, so a typo in a hostname silently downgrades a host to third-party, which fails safe but will annoy you. A flag is a reason to look, not a verdict.

AI disclosure: I wrote boundary_leak_probe.py with AI assistance and ran it myself, offline, before publishing. Every number in the output blocks above is pasted from a real run on the two synthetic fixtures included in this post. No real keys exist in them: the 0x4c08... private key is a well-known public test key from web3 tutorials and the legal winner thank... phrase is BIP-39 test vector #2 from the spec itself, both burned and never tied to real funds; every other value is a placeholder. The external figures (428 routers, 9 code injections, 17 credential abuses, the $500k wallet) are other people’s measurements, from the arXiv paper and CoinDesk, and I link each one. I label which numbers are mine and which are theirs.

The open question I have not answered for myself: handle references like $VAULT_REF:openai only stay safe if the substitution happens at a trusted edge you control, after the probe runs. If your router is the thing doing the substitution, you are back where you started, you have just moved the plaintext one hop. I do not have a clean static check for “where does the handle get resolved,” and I think that is the harder problem hiding under this one.

If you run agents through a router or an MCP proxy, dump one real egress map and run this against it before you read the next router-breach headline. Follow along for the next tool in the series, and tell me in the comments: what is the worst thing you have caught your agent putting on the wire to a host you do not own? I read every reply.