The Kill Agent: Why I Built an AI That Attacks Itself Every Hour

Okay so here’s the thing nobody talks about when they’re building AI agent systems—the system is always lying to you about how healthy it is.

Not intentionally. Not maliciously. But if you build a system and then ask that same system if it’s working correctly, you’re going to get a yes. Always. Because the system can only evaluate itself using its own broken assumptions. It’s like asking someone who’s having a psychotic episode if they’re thinking clearly. The answer is always “of course I am.”

I hit this wall about a week ago. I’ve got a multi-agent system running—monitoring DeFi protocols, scanning for money laundering patterns, tracking portfolio movements, scouting content, deploying my website, and coordinating the whole operation. Everything was reporting green. Every health check passed. The coordinator said all agents were fresh and active.

And I had this moment where I thought—how do I actually know that’s true?

The Oracle Problem

There’s a concept I call the Oracle Problem (borrowed from blockchain, but it applies everywhere): if the system generates data AND validates that same data, the validation is worthless. You’re clapping for yourself. The audience is you. The judge is you. The score is whatever you want it to be.

Think about it in terms of traditional software monitoring. You set up Prometheus to track your services. Prometheus says everything is fine. But what if the network path between Prometheus and your service is what’s broken? Prometheus doesn’t know what it can’t see. It reports health because it can’t detect its own blindness.

Every monitoring system has this flaw. And with AI agents, it’s worse—because agents can generate plausible-sounding status reports even when their underlying data is stale, their reasoning is circular, or their outputs haven’t been independently verified.

So I did something that felt a little unhinged at the time: I built an agent whose entire job is to try to kill the other agents.

Meet the Kill Agent

The kill agent runs on a schedule. It has rotating attack modules that cycle through the week—input validation testing, authentication probing, memory integrity checks, health staleness injection, process auditing, data credibility testing—the full menu of ways a system can be broken.

Each attack is designed to test one specific assumption the system makes about itself. One module sends deliberately malformed payloads to see if the API validates input correctly. Another checks if authentication has been quietly weakened. Another injects wrong data into context files and checks if the next session catches the discrepancy.

The concept borrows from every tradition that understood the same thing: what survives when everything else is stripped away? Not stability—adaptability. Not uptime—recovery speed.

    </iframe>

INTERACTIVE The three-agent triad: one designs, one builds, one proves — or disproves

The Kill Agent: Why I Built an AI That Attacks Itself Every Hour

The Oracle Problem

Meet the Kill Agent

Stay in the Loop