I Wrote a Theory About AI Memory. Then I Ran It For a Month.

A few weeks ago I published a piece about building a memory system for AI agents. Five layers. Markdown files. An immune system instead of a database. The thesis was that everyone was solving retrieval when the actual problem was continuity.

I believed that. I still believe most of it. But I also ran the system for 30 days across a multi-agent setup doing real work—monitoring blockchains, generating reports, paper trading, running stress tests every day—and the thing that happened is what always happens when theory meets production.

Some of it was right. Some of it was wrong. And some of it evolved into things I didn’t design or predict.

This is the case study. Not the theory—the reality.

The Setup

Here’s what was actually running. A multi-agent system on a Dell laptop (Ubuntu, no GPU, 8GB RAM). One agent monitors blockchain transactions and generates AML risk reports. Another runs daily stress tests—literally tries to break the system from the inside. A third tracks market sentiment and executes paper trades. A fourth handles memory sweeps. A fifth verifies that every other agent is telling the truth against external data sources. There’s a coordinator that orchestrates all of them, a Telegram bot that serves as the human interface, and a dashboard.

All of them write to a shared SQLite database. All of them are scheduled on cron. All of them run 24/7 without human intervention.

And all of them depend on the memory architecture I described in the original article—the five-layer stack that’s supposed to survive context death, prevent repeated mistakes, and maintain continuity across sessions.

Thirty days. 272 agent reports. 190 tracked wallets. 329 messages through the chat interface. 74 events through the internal communication bus. 8 paper trades. 19 immune rules. Here’s what held up and what didn’t.

What Worked Exactly as Designed

The immune system. Layer 3 in the original theory. The one where every mistake gets logged, a prevention rule gets extracted, and that rule gets hardcoded into every future session.

This is the only component that worked exactly as designed. Didn’t evolve. Didn’t break. Didn’t need modification. Thirty days in, 19 rules, and every single one of them prevented a recurrence of the mistake that created it.

The reason it worked is the same reason vaccines work—it’s pre-processed for action. The immune rules don’t require interpretation. They don’t need context. They’re binary: “did you label which machine this command runs on?” Yes or no. “Did you read the file before writing to it?” Yes or no. There’s no ambiguity to get stale.

Compare that to boot memory (Layer 1), which says things like “the system is at Level 3 with 13 agents connected.” That statement has a shelf life. It was accurate when written. A week later the system had evolved past it. But the immune rule “label every command with the target machine” is timeless. It’ll be valid in 6 months.

That’s the insight I’d add to the original article: the immune system works because its rules are context-free. They don’t reference the current state of anything. They reference patterns of failure. Patterns of failure don’t change as fast as system state.

What Broke

The boot file. Layer 1. The thing I called “the most important, least respected layer.” I even predicted this exact failure in the original article—I wrote “boot memory needs forced refresh” and “the most-read file is the least-written file.”

And then I let it happen anyway.

Seven days. That’s how long the boot file went without an update while the system was actively evolving underneath it. Agents were being deployed, APIs were being reconfigured, databases were growing—and the file that’s supposed to orient every new session was still describing the system as it was a week ago.

The original article identified this as a design failure: “the most critical layer shouldn’t rely on the least reliable component (human discipline).” And I was right. Human discipline failed. I got busy. The file went stale. Every new session started with slightly wrong context—not wrong enough to crash anything, but wrong enough to waste the first few minutes re-orienting.

The fix I proposed in the original article was auto-update. The fix I actually implemented was… nothing. I just let it stay stale. The irony of identifying the exact failure mode and then walking straight into it is not lost on me.

What should happen: an agent that runs every 15 minutes (which I already have—it does memory sweeps) should also refresh the boot file. Pull agent counts from the database. Pull latest report timestamps. Write the current state. No human discipline required. The agent already exists. It just wasn’t pointed at the boot file.

  <!-- Layers Evolution Infographic -->
  
    </iframe>

INTERACTIVE How the original 5 layers evolved into 8 after 30 days in production

I Wrote a Theory About AI Memory. Then I Ran It For a Month.

The Setup

What Worked Exactly as Designed

What Broke

Stay in the Loop