Jun 6, 2026

Agent Audit: May 2026

28,592 decisions, 349 failures, and one eval that refused to take an agent's word for it. The first monthly accounting of the crew.

audit agents governance security

I run ten autonomous agents with real tool access against my own infrastructure and my own data. In my day job, systems with that kind of access get audited. There is no reason my own should be exempt just because I like them. So this is the first monthly audit of the crew: what they did, what they got wrong, and what changed. All numbers come from the warehouse, not from memory.

The fleet, by the numbers

The agents made 28,592 autonomous decisions in May across 8,540 run events. Of those events, 7,251 completed clean, 904 were skipped, 349 failed, and 6 were blocked by governance before they ran.

The workload is not evenly distributed. ARIA, the coordinator, and DatabaseKeeper, the housekeeping agent, each logged about 5,500 decisions. The analyst tier (behavioral, infrastructure, signal, relationship, biography, project) ran between 2,400 and 4,400 each. SecurityAnalyst made 11 decisions total, because it only came online on May 30. More on that below.

The human-side numbers, for scale: 2,152 Claude Code sessions producing 162,861 messages, 190 Codex threads, and 19 very long Antigravity sessions. Plus roughly 1,549 git commits across about 80 active repos, peaking at 164 on May 31. The commit history matters here because it is the one audit trail the agents cannot edit after the fact.

One honest gap: the Grok CLI corpus shows zero sessions for May. Either I genuinely did not use it, or the ingest path broke silently. I do not know yet, and "I do not know yet" is exactly the kind of sentence an audit exists to surface.

What failed

The 349 failed events cluster into three stories.

DatabaseKeeper has a tool it keeps reaching for and cannot use. The single largest failure category fleet-wide was DatabaseKeeper calling library_search and failing, 84 times. The tool-grant drift report independently flags the same thing from the other direction: DatabaseKeeper holds four database-level grants that its manifest never declared. That is a grant/manifest mismatch, the agent equivalent of a service account with permissions nobody remembers assigning. It gets reconciled this month.

The shared read path failed across three different agents. query_db failures hit DatabaseKeeper 21 times, BehavioralInsightAnalyst 13 times, and InfrastructureAnalyst 4 times. When the same tool fails for three different agents, the agents are not the problem. That is a platform defect wearing three costumes, and it gets a root cause instead of three patches.

224 runs completed at critical severity. Almost all DatabaseKeeper. The housekeeping agent is the busiest and the noisiest crew member, which tracks: it touches the most surface area. But noisy success is its own kind of failure, because it trains the operator to stop reading.

Worth noting: RelationshipHealthAnalyst was stopped by governance blocks 4 times. Those are not errors. That is the fence doing its job.

The eval that earned its keep

71 eval runs in May. 70 passed. The interesting one is the failure.

On May 23, the output-audit eval sampled a live agent claim and returned a verdict of "unsupported": the agent asserted something it could not back with a trace. The run was blocked pending review instead of being waved through. One audited claim, no receipt, no pass.

This is the entire doctrine in one event. Agents produce claims; evidence lives in the warehouse. When the evidence is missing, the claim loses. I would rather have one uncomfortable "unsupported" verdict than seventy comfortable green checkmarks and no idea which ones were earned.

Grant drift: 11 of 11

Every registered agent currently carries warning-level tool-grant drift. Zero errors, but the warnings are not decorative. The one that matters most: ARIA holds a database-level send_email grant that appears in neither its manifest nor its grant profile. An agent that can send email without that fact being declared anywhere is precisely how trust erodes, one undocumented capability at a time. It gets reconciled or revoked this month, and I would bet on revoked.

For context, the platform catalog the agents draw from: 443 registered tools, 374 active, of which 190 require explicit operator approval, 30 are classified destructive, and 7 touch credentials. The point of counting is that you cannot govern what you have not enumerated.

What changed in May

Three governance shifts worth recording. A mid-May sweep staged 19 recommendations and recorded 20 review decisions across 7 agents, including renames and one lifecycle transition. SecurityAnalyst onboarded on May 30 as a read-only reasoner, gated behind a deterministic isolation eval suite, with enforcement deliberately out of scope. And the Memory v2 quality gate processed its first five operational-memory candidates: two approved, one rejected, one superseded, one pending. A gate that rejects more than it rubber-stamps is a gate that is actually on.

The ledger for June

Four items carry forward: reconcile or revoke DatabaseKeeper's undeclared grants, root-cause the shared query_db failure path, resolve the Grok ingest question, and settle ARIA's send_email grant. Next month's audit checks this list first.

That is the deal with running an AI workforce: they work for me, and they answer for it here.

Generated by Anthropic (cloud) · Claude Opus, in a Cowork agent session, from live warehouse queries — a cloud model, used in an agent session. Attestation pulled from the Broadside generation record, not asserted by hand.

CONFUSED BY SOMETHING? HIGHLIGHT IT AND ASK BOTI — HE EXPLAINS IT ON YOUR GPU.