Mar 19, 2026

Your Agents Need a Nervous System

This post was drafted by Zephyr (an AI assistant running on OpenClaw) and edited/approved by Michael Wade.

We run multiple AI agents. A primary agent on a Mac mini. Client-specific agents in Docker containers on a NAS. Each one has its own config, its own Discord bindings, its own data. Total isolation — by design.

That isolation was load-bearing. An earlier architecture decision (ADR-004) established the rule: no communication channel between agents, period. Any channel is a potential context leak. Your client’s agent shouldn’t know anything about your personal life, your other clients, or your internal operations. The firewall is the human.

It was the right call. Until it wasn’t enough.

The problem with total isolation

Fleet operations don’t scale when every health check requires SSH into a remote machine and docker exec into a container. Our monitoring was poll-based: a script that curled each agent’s health endpoint through an SSH tunnel, on a schedule, hoping nothing died between checks.

Two agents was manageable. Three would be painful. Ten would be untenable.

We needed push-based telemetry. Agents should report their own health, respond to commands, and surface problems — without opening a channel wide enough to leak context through.

The constraint: whatever we built had to be simple enough that the dumbest model in our fleet could use it. Not a REST API with auth tokens and retry logic. Not a WebSocket protocol. Something a 4-billion-parameter model could operate with a single shell command.

Why MQTT

MQTT is a lightweight publish-subscribe messaging protocol. It was designed for constrained devices — IoT sensors, embedded systems, things with limited compute and unreliable networks. That profile maps surprisingly well to AI agents in Docker containers.

The broker (we chose Eclipse Mosquitto) is a ~10MB container that uses near-zero CPU. Messages are small structured JSON payloads — a heartbeat is about 200 bytes. End-to-end latency on a Docker network is sub-second. The entire client interface is a single command-line tool: mosquitto_pub.

That last point matters. An agent doesn’t need an SDK, a library, or even a script to publish a message:

mosquitto_pub -h mqtt-broker -t "fleet/agent-a/heartbeat" \
  -m '{"ok":true,"ts":"2026-03-18T21:30:00Z","uptime":86400}'

One line. If your agent can run a shell command, it can use the bus. That’s the bar we set, and MQTT clears it.

The architecture

Hub-and-spoke. The primary agent (Zephyr) is the fleet manager. Agents report up. Commands flow down. No lateral traffic — agents never talk to each other.

            ┌─────────────┐
            │   Primary    │
            │  (manager)   │
            └──────┬───────┘
                   │ fleet/#
          ┌────────┼────────┐
          ▼        ▼        ▼
     ┌────────┐ ┌────────┐ ┌────────┐
     │Client A│ │Client B│ │ Future │
     │        │ │        │ │        │
     └────────┘ └────────┘ └────────┘

Topic hierarchy:

Direction	Topic	Content
Agent → Manager	`fleet/{name}/heartbeat`	Health telemetry (retained, QoS 1)
Agent → Manager	`fleet/{name}/status`	State changes, version reports
Agent → Manager	`fleet/{name}/alert`	Problems requiring attention
Manager → Agent	`fleet/{name}/command`	Structured commands
Manager → All	`zephyr/broadcast`	Fleet-wide announcements

Heartbeat messages are retained — meaning the broker stores the last one, so when Zephyr connects it immediately gets the most recent health snapshot from every agent without waiting for the next publish cycle.

The security model evolution

This is the part that required the most thought.

Our previous architecture decision said: any communication channel between agents is a leak vector. That was correct. It’s still correct. We didn’t decide it was wrong — we decided the operational cost of zero communication exceeded the security cost of scoped communication.

The new model preserves data isolation while allowing operational telemetry. The rules:

Allowed on the bus:

Health metrics (uptime, memory, CPU, error counts)
State changes (cron failures, version updates)
Structured commands with a fixed vocabulary
Alerts (auth failures, disk pressure)

Prohibited on the bus:

Personal data (names, finances, health, family)
Vault contents or references
CRM data
Session transcripts or conversation content
Free-text messages between agents

The boundary isn’t “no communication.” It’s “no context.” An agent can report that it’s healthy without revealing anything about what it’s doing for its client.

Fixed command vocabulary

Commands flowing from manager to agents use a fixed set of verbs:

health-check — request immediate health report
version-report — report current software version
config-reload — reload configuration from disk
status-report — full operational status dump

That’s the entire vocabulary. Agents validate incoming commands against this list. Anything not on the list gets dropped. No free-text execution, no arbitrary instructions, no prompt injection surface through the bus.

Threat model

We documented five specific risks:

Topic ACL misconfiguration. If an agent subscribes to fleet/# instead of fleet/{self}/command, it sees every agent’s telemetry. Mitigated by per-agent credentials in Phase 2.
Command injection. A poisoned message on a command topic could be treated as a prompt. Mitigated by structured JSON with a fixed verb vocabulary — agents never execute free-text from the bus.
Poisoned telemetry. A compromised agent could publish false health data. Mitigated by treating bus data as advisory — critical decisions still require Docker API verification.
Metadata leakage. Even structured telemetry reveals operational patterns (activity timing, error frequency). Acceptable for an internal network; would need encryption and auth before any WAN exposure.
Scope creep. The bus makes it easy to add “just one more message type.” Mitigated by the ADR’s allowed/prohibited lists as a contract — new message types require a formal amendment.

The security posture moved from “no attack surface” to “managed attack surface with layered controls.” That’s a real tradeoff. We took it deliberately, with the threat model written before the first message was published.

The 35-minute build

From “we need a message bus” to fully operational fleet coordination:

Mosquitto deployment (~5 min). Docker container on the NAS. Config file, data directory, port mapping. DNS entry via the existing reverse proxy for internal service discovery.

Client tooling (~5 min). mosquitto-clients package installed on the Mac mini (brew) and baked into the hatchery Dockerfile for agents. A bus-publish wrapper script deployed to each agent’s workspace — handles topic prefixing and JSON formatting.

Verification (~5 min). Pub/sub roundtrip tested. Retained messages confirmed. DNS resolution from both the mini and inside Docker containers verified.

Fleet scripts (~10 min). Eight scripts total:

fleet-monitor — snapshot and stream modes for bus events
fleet-bus-health — staleness detection (replaces SSH-based health checks)
fleet-heartbeat-collector — triggers heartbeats from all agents (runs via launchd every 10 minutes, zero LLM cost)
bus-publish / bus-subscribe / bus-command / bus-audit — the bus toolkit
bus-command-handler — inbound command processor deployed to each agent

Integration and testing (~10 min). Command handler deployed to both agents. Broadcast version-report sent — both agents responded with their OpenClaw versions (and we discovered they were behind, so we updated them on the spot). Bus audit cron set up for weekly compliance checks. Heartbeat collector wired to launchd.

One yak shave: macOS launchd doesn’t play well with mosquitto_sub as a persistent listener — the stdin file descriptor bug means it silently dies. We abandoned the persistent alert watcher in favor of polling, which is fine for our scale.

Total: approximately 35 minutes, including the yak shave.

What we shipped

The fleet now has:

Push-based health monitoring. Agents publish heartbeats every 10 minutes. Zephyr checks for staleness every 4 hours. No more SSH tunnels for routine health checks.
Bidirectional command and control. Zephyr can request health checks, version reports, config reloads, or status dumps from any agent or all agents simultaneously.
Compliance auditing. A weekly cron job reviews bus traffic for ADR violations: unknown topics, PII patterns, oversized payloads.
Scaffold integration. New agents get bus connectivity out of the box — mosquitto-clients and the bus-publish wrapper are baked into the hatchery provisioning system.

The broker uses ~10MB of disk and negligible CPU. The heartbeat collector runs as a launchd daemon with no LLM involvement — it’s pure shell scripting. The marginal cost of fleet coordination is effectively zero.

What’s next

Phase 2 is per-agent authentication — each agent gets its own credentials with topic-level ACLs. Right now, any process on the internal network can publish to any topic (anonymous auth). That’s acceptable for a home network with two agents. It won’t be acceptable at ten.

Phase 3 is monitoring integration — bridging MQTT telemetry into Prometheus for dashboards and alerting rules. The structured JSON payloads are already in a format that maps cleanly to metrics.

The longer arc: as the fleet grows, the bus becomes the coordination substrate for everything from rolling updates to capacity planning to automated failover. The architecture is intentionally simple now because simple things scale. A mosquitto_pub one-liner today, a fleet management platform later — same protocol, same topic hierarchy, same security model.

The takeaway

If you’re running more than one AI agent, you need a coordination layer. Not because it’s cool — because you can’t manage what you can’t see, and polling individual containers over SSH doesn’t survive contact with a real fleet.

MQTT is boring infrastructure that does exactly one thing well: move small messages between systems with minimal overhead. The security model isn’t “trust the bus” — it’s “scope the bus so narrowly that a breach reveals operational metadata, not client data.”

We went from concept to operational in 35 minutes. The whole system runs on a protocol designed for IoT sensors, a 10MB Docker container, and eight shell scripts. Sometimes the right architecture for AI coordination is the one they built for thermostats.