19 min read
The Kinshasa Manifesto
Event-based distributed architecture patterns antifragile to the agentic era
- architecture
- agents
Note: I am not a LinkedIn Guru™ nor do I wish to be. Some of these ideas I am confident in and have tested, others need the tires kicked. Like anything great that we build, I want this to be collaborative. Let’s human on this one. This is, and forever will remain, a rough draft.

Float like a butterfly, sting like a bee. Yes I am aware this is from a different fight with Sonny Liston but it’s a better picture and you get the point let’s move on.
October 30, 1974. Kinshasa (now, the Democratic Republic of the Congo). “Rumble in the Jungle” was an unwinnable fight, and Ali knew that. George Foreman was 7 years his junior and in the prime of his career; a fearsome puncher who overwhelmed his opponents with sheer power. So Ali did a Kobayashi Maru. Hacked the simulation. Foreman’s advantage would be made his greatest weakness. Spent months learning how to take a beating, up against the ropes. Rope-a-Dope. For seven rounds, Ali took that beating. Played possum. By the 8th round, Foreman was exhausted. As planned, Ali sprung from defense into offense, landed a choreographed sequence ending with a right hand to Foreman’s chin, dropping him. It was Foreman’s first professional defeat.
The following principles are not new. This is the purest form of distributed systems. What our craft has strived for and fallen short. And now, not only can we, we must.
Welcome to Kinshasa.
We are prepared, and winning is the only option.
Goals
- Empower humans to do their best work
- Do not create more problems for humans
- Don’t just defend against agents — steer them
That’s the charter. The first two lines are the easy part: if a pattern doesn’t make a human’s best work more likely, or it quietly hands a human a new problem to babysit, it doesn’t belong here.
The third line is the one that turns defense into design. The best analogy I have is self-documenting code — code structured so the reader falls into the right understanding without a comment holding their hand. The names, the shapes, the boundaries are the documentation. A Kinshasa architecture aims to be self-governing in exactly that way: structured so the agent falls into the right move without a human standing over it. The guardrails aren’t bolted on top. They’re the shape of the thing.
A note on scope before we climb in: this is intentionally stack-agnostic, with two deliberate exceptions. The pattern leans hard on two concrete things, and naming them beats hand-waving.
The first is a durable, ordered, replayable event log — Kafka. Not the only option, just the most battle-proven distributed log we’ve got, and saying “a log” when I mean that would be coy. The second is the gateway, which I draw as GraphQL: the pattern needs a single typed front door where every mutation is already a declared command, and that’s exactly what a stitched GraphQL schema hands you — one place to mint the run ID and seat the circuit breaker, instead of twelve. Every other box — the services, the coordinator — is a role you can fill with whatever you like. But when I say “the log,” I mean Kafka, and when I say “the gateway,” I mean a GraphQL door, specifically.
And I’m deliberately not talking much about the agents themselves. This is about the deterministic part of the system — the part you actually control. The agents are the weather. This is about building a house that stands up in the weather.
Antifragility: architecture as governance
Here’s the move, and it’s the whole thing: you cannot out-power the agent.
The instinct, when you first hand real work to an autonomous agent, is to stand over it and throw counter-punches. Watch every move. Approve every step. Catch every mistake by hand. That’s a human being Foreman — trying to win on power. It doesn’t scale and it misses the point. The agent is faster than you, tireless, and capable of throwing a thousand writes while you’re still deciding how to block the first one.
So stop boxing. Lean on the ropes. Let the structure do the governing.
The principle underneath every section here is antifragility, and it’s worth being precise about the word, because it gets thrown around loosely. There’s a gradient:
- Fragile breaks under stress.
- Robust survives stress, unchanged.
- Antifragile improves under stress.
Most “resilient” architecture aims for robust — it shrugs off the hit and carries on exactly as it was. Kinshasa aims one notch past that. Ali didn’t survive the eighth round. He used it. An antifragile system treats every agent screwup as fuel: the failure becomes a rule, a threshold, a test, a gate that wasn’t there before. The system gets harder to break the longer it runs.
That’s what I mean by architecture as governance. The structure isn’t a fence you build and forget. It’s the thing absorbing force, redirecting it, and tightening itself every time it gets hit. Control lives in the world the agent moves through, not in a human standing over it.
Attributes
Six principles make a system Kinshasa-shaped, and the first is the one the other five exist to produce:
- Antifragility — it gets stronger every time it’s hit.
- Observability — you can see what happened.
- Immutability — you can’t quietly erase what happened.
- Minimal blast radius — when it goes wrong, it goes wrong small.
- Recoverability — you can undo it, at scale.
- Punching — go on offense: steer the agent until the only move left is the right one.
Antifragility is the goal. Observability through recoverability are how you take the beating. Punching is how you win the fight.
Architecture overview
Before we go principle by principle, here’s the whole thing on one page.
The gateway is the membrane. Above it sits the only nondeterministic thing in the system — the agent. Everything behind the gateway is deterministic: replayable, ordered, idempotent. And the gateway doesn’t call services — it writes a command straight to the log. From there one deterministic core does the work, blind to who produced the command: the gateway for forward writes, the coordinator for compensations. Forward and reverse are the same machine, run by different hands.
It’s three moving parts and one boundary, and the boundary is the point. The gateway is the only door: every agent action comes through it, picks up a run ID, and passes the circuit breaker. Behind it, the system is unglamorous event-driven plumbing — the gateway writes a command to a Kafka topic, services consume commands and emit events back to Kafka, and a durable coordinator watches the event log so it can push compensations back through the same command log when a run goes bad.
The line across the middle is the whole game. The agent is the only nondeterministic actor in the picture. Everything behind the gateway is deterministic — ordered, replayable, idempotent — and that is not an accident of implementation. You make an agent’s blast radius governable by making everything it can touch behave like a machine instead of a coin flip. The rest of this post is six ways of saying that.
Observability
You can’t operate what you can’t see, and you definitely can’t undo what you can’t see.
The non-negotiable: every event tracks its actor. Human, service, or agent — every change carries the identity of who or what caused it. And events span across services, so a single causal chain that touches six systems reads as one story instead of six disconnected log lines.
This sounds like table-stakes telemetry, and the “structured logs, traces, metrics” part is. But there’s a sharper requirement hiding here that the rest of the architecture leans on entirely: you must be able to ask “show me everything this one actor did” and get a complete, queryable answer across the whole system. Hold onto that. It’s the linchpin that makes recovery possible at all.
Immutability
The cheapest way to make the past recoverable is to refuse to destroy it.
The rule: database entries can only be inserted or selected. No deletes, no in-place updates. A “delete” is a soft-delete — a tombstone, not an erasure. An “update” is a new version that supersedes the old one without obliterating it.
Immutability is what gives you something to roll back to. You can’t restore a value you overwrote into oblivion. Append-only is recoverability’s foundation, paid for up front.
Minimal blast radius
When — not if — an agent goes sideways, the damage should be contained to one small room.
Three levers, all old, all underused:
- Least privilege. An agent gets exactly the access its task needs and nothing more. The blast radius of a compromised or confused actor is bounded by what it was allowed to touch.
- Sandboxing. Faults stay local. Per-service data boundaries, per-service object namespaces, scoped sessions that each carry their own constraints. A bad write in one domain can’t reach across and corrupt another.
- Rate limiting. An agent that’s malfunctioning will malfunction fast. A ceiling on writes-per-unit-time turns “ten thousand bad rows in four seconds” into “forty bad rows and a tripped alarm.” Velocity caps buy you time, and time is the whole game.
There’s a discipline that belongs here too: don’t batch operations in ways that multiply blast radius. A convenience that fans one action out across a dozen services has just made every failure twelve times worse. The smaller and more bounded each unit of work, the smaller the worst case.
Recoverability
This is the heart of it, so we’re going to spend real time here.
Reframe the word first. Recoverability is not “we can restore from backup.” Backups are the floor — the thing you fall back to when everything else has failed. Real recoverability is sharper: every agent-originated change is correlated, reversible, and rate-limited; and anything that isn’t reversible is gated before it ever happens.
The problem backups can’t solve
Picture the failure that keeps you up at night. An agent run issues thousands of incorrect writes, and they fan out across multiple services. Now try to fix it with a per-service backup restore. You can’t:
- Restoring one service to an earlier point throws away the good, concurrent writes that landed in the same window. You’d be punishing the whole system for one actor’s mistake.
- There’s no cross-service coordination — you’d be restoring six databases to six slightly different moments and praying they’re consistent.
- And replaying your event log just faithfully reproduces the bad data, because the events really happened.
Backups are a sledgehammer. This needs a scalpel.
Rollback doesn’t exist out here. Compensation does.
This is the single most important distinction in the whole post, and it’s the one people get wrong:
- Rollback restores prior state atomically, as if it never happened. It’s traceless. And it exists only inside a single transactional boundary — one database, one transaction. That’s it.
- Compensation is a new forward action that semantically counteracts a prior one. The undo of a posted charge isn’t the charge vanishing — it’s a refund record sitting next to it. Eventually consistent. Leaves a trail.
True distributed rollback across autonomous services and Kafka and external API calls cannot exist. Not “hasn’t been built” — cannot. It would require global two-phase locks, which is exactly the synchronous coupling event-driven systems are built to escape, and which can’t span an external payment API anyway. Events are immutable facts; you can’t un-publish one that’s already been consumed. You undo by appending corrective events. The workflow engines that look like they “roll back” are just orchestrating compensations you wrote — no engine conjures an undo out of nothing.
So: never reach for “distributed rollback.” Reach for compensation, plus a stack of cheaper containment layers underneath it.
The recovery gradient
Strongest and cheapest at the top; weakest and most expensive at the bottom. You want to catch as much as possible high on this list.
-
Real rollback — within one service. If a multi-write operation fits inside one service’s transaction, keep it there. ACID does the work. It’s free, it’s atomic, it’s traceless. Don’t distribute what doesn’t need to be distributed.
-
Pseudo-rollback — the quarantine window. This is the primary lever, and it’s beautiful in its simplicity. Writes buffer as
pendingand do not propagate — no events emitted, nothing downstream sees them — until a short delay elapses. Abort inside that window and the un-propagated writes are simply discarded. Nothing ever observed them, so it’s a true rollback with zero compensation needed. Tuning the width of that window is how you catch most runaway runs cheaply: the more damage you can keep inside the window, the more of it you can erase for free. -
Event-driven compensation — for what escaped. This is the saga, and it’s the expensive tail case, not the common path. Eventually consistent, trail-leaving, real work. We’ll walk it below.
Underneath all three sits a hard rule: irreversible egress is gated before commit, never compensated after. You cannot un-send an email or un-charge a card with a clever event. So anything with no inverse doesn’t get to ride the recovery gradient at all — it gets stopped at the door and held for approval.
That gives you two distinct controls, and conflating them is a classic mistake:
- The quarantine window optimizes for throughput: auto-release after a delay, with a kill switch during it. The default for normal work.
- The approval gate optimizes for safety: nothing commits without explicit human or policy sign-off. The default for the dangerous classes — irreversible egress, bulk deletes, anything over a threshold — regardless of what else is happening.
Correlation is the linchpin
Here’s where Observability cashes its check. The unit of control is the agent run, not the agent. An agent identity spans thousands of runs over its life; that’s far too coarse. A single run is the bad batch you actually want to grab.
So a single chokepoint — the gateway every write passes through — mints a run ID at the start of each run and propagates it into every downstream write. Each service stamps that ID onto the rows it touches. The payoff is enormous and simple: “everything this run did” becomes a query. One WHERE run_id = X across the whole system, and the blast becomes a selectable set. Every recovery mechanism below is built on top of that one capability.
The circuit breaker
Now the rope-a-dope gets mechanical. The circuit breaker is scoped per run and enforced at the gateway — the one place that sees the entire fan-out. And it plays two roles at once, which is the whole point: it’s both the stopper (it halts forward writes) and the trigger (it kicks off reversion of what already landed).
It watches a handful of counters per run:
- Entity count — how many rows this run has mutated.
- Service span — how many distinct services it’s touched.
- Velocity — writes per unit time.
- Anomaly — repeated validation rejections, or near-identical looping writes.
And it borrows its states straight from the electrical part it’s named after:
- CLOSED — normal. Writes flow, counters tick up.
- OPEN — a threshold tripped. The gateway rejects further writes for that run, with a structured error. The run is paused mid-swing.
- HALF-OPEN — after an explicit human OK, a few writes are let through to test the water before returning to CLOSED.
When it trips OPEN, two things fire simultaneously. It halts all further forward writes for the run. And it triggers reversion — flipping any of the run’s still-quarantined writes from “auto-release soon” into a manual, alerted hold, and kicking off compensation for anything that already escaped the window. Lower thresholds keep more of the damage inside the window, where it’s free to discard. That’s the dial you tune.
Reversion: the gateway is the doorbell, not the demolition crew
When compensation does have to run, it runs over Kafka — not as a pile of synchronous calls from the gateway. The gateway only rings the bell. A durable coordinator drives the actual work, and each service applies its own inverse. That division of labor matters, so let’s be exact.
On the forward path, every canonical write emits a domain event carrying the run ID and a before/after image. A saga-log consumer projects those into a per-run ledger — the ordered record of everything the run did. The breaker enforces itself right here, on the way through.
On the reverse path:
- A trip — or an explicit “revert this run” command — emits a
revert-requestedevent for the run. - The coordinator consumes it, reads the ledger, and emits each step’s declared inverse in reverse order, keyed by entity.
- Each service consumes its compensate command, applies the idempotent inverse in a local transaction, and emits a
compensatedevent. - The coordinator collects the acks, retries stragglers, and — when they’re all in — marks the run reverted.
The coordinator is a durable-execution engine — the category that keeps workflow and step state in a database so it survives its own crash and resumes from the last completed step mid-rollback. (Temporal and DBOS are two in this family.) Crucially, it orchestrates the compensations; it doesn’t invent them, and it never sees a service’s internals.
Compensation is a domain inverse, owned by the service
This is the part most people get wrong, so it’s worth dwelling on. The tempting “free” approach is mechanical row-inversion: have every change journal a { table, primary key, before-image }, and to undo, just slam the old row back in. It looks elegant. It’s quietly wrong.
- It leaks every service’s schema into a shared journal and coordinator. Change a column and the ripple reaches everything.
- It can violate current invariants — the old row may not even be legal anymore.
- It’s often semantically wrong. The undo of a posted financial transaction is a reversing entry, not a row deletion. The history is supposed to survive.
So the inverse belongs to the service that owns the domain. Each forward operation has a declared inverse the service implements and exposes: archived → unarchive, posted → reversed, linked → unlink. The coordinator’s ledger holds only (run_id, step_id, a reference to the forward event) and dispatches each declared inverse in reverse order. It deals purely in the contract — no table, no primary key, no raw row image ever crosses the boundary. Where an undo needs a prior value, the service chooses to carry it as domain data in the event (renamed { old_name, new_name } → inverse rename to old_name), scoped to exactly what reversal needs.
The honest tradeoff: this is more code than the row-restore fantasy. Every service writes a compensating handler per forward operation. What you buy is correct semantic inverses, intact service boundaries, invariants enforced at compensation time, and a coordinator that’s blind to schema. The mechanical approach was only ever “free” because it was wrong.
And some cases stay irreducible, no matter how clean your design: the inverse of an event is usually a distinct compensating event, not a literal negation. Inverses must be idempotent so a replayed revert is a no-op. Egress has no inverse at all — gate it before commit. And lossy operations may only partially compensate. The architecture doesn’t pretend these away; it makes them the small, named exceptions instead of the silent default.
The operator in the loop
Automatic detection, manual decision. When an agent goes rogue, a human gets pulled in — not to watch the firehose, but to make the one call that matters.
A trip, an anomaly, or any run flagged rogue pages a human operator. The alert is self-contained: which agent and entry point, what tripped (the count, the velocity, the service span), a summary of what the run touched, and the current state — what’s still held in quarantine versus what’s already propagated. No new dashboard to go hunting through. The decision comes to you.
And the decision surface is small on purpose. For the flagged run:
- Held, in-window writes → discard (the cheap pseudo-rollback) or release.
- Propagated writes → revert the run, which fires the compensation path above.
Because everything is keyed by run ID, this scales to bulk trivially. “Compensate a bunch of stuff” is just a set of run IDs: one run, several runs, every run from a given agent or entry point since time T, a whole time range. The coordinator expands the selection and drives all their compensations as one operator action. It’s idempotent and replay-safe — re-reverting an already-compensated run is a no-op, and overlapping bulk selections are fine. Egress is never in the compensable set, so a bulk revert can only undo recoverable state; it can’t un-send.
One default worth stating out loud: when the human is unreachable, hold. A tripped run’s quarantined writes stay held indefinitely — nothing propagates — and because the breaker is per run, holding one run blocks nothing else. If you want a timeout, make it default to discard, not release. Held-then-ignored beats held-then-released-unreviewed every single time.
Punching
Seven rounds of defense, and we still haven’t thrown a punch. Here’s the eighth — except the real offense was happening the whole time, and it doesn’t look like a punch at all. It looks like steering.
Here’s the reframe that ties the whole thing together: the architecture steers the agent into correct behavior, because there is no other behavior available. That’s the offense. Not a counterpunch you throw after the agent screws up — a ring shaped so the agent can only move where you want it to.
Think back to Goal 3, and to self-documenting code. Good code doesn’t make you read it correctly; it’s shaped so the correct reading is the only one that fits — the names, the types, the boundaries leave no room for the wrong interpretation. A Kinshasa architecture does the same thing to an agent. The wrong move hits a wall: a gate, a breaker, a rejected write. The right move flows downhill. So the agent does the right thing — not because a human is standing over it, and not because it’s especially well-behaved, but because the structure has quietly removed every other option. The path of least resistance is the correct path. That’s the strongest form of control there is: the kind nobody has to enforce.
When steering isn’t enough — when an agent finds a way to be wrong inside the rules — the same machinery becomes a literal punch. Correlation turned “ten thousand writes across six services” into a single thing you can knock out with one motion. The breaker is the jab that freezes the opponent; bulk compensation is the combination Ali threw in the eighth — one operator action that drops an entire rogue run, or every run from an agent since lunch, in a single swing.
And then the system learns the punch. This is the antifragile core, and it’s one principle stated five ways:
A failure that doesn’t change the system is a failure you’ll see again.
So every absorbed hit tightens the steering. The threshold that should have caught the run gets lowered. The inverse that was missing gets declared. The operation that escaped the window gets a narrower window, or a gate. Each correction is one more wall in the maze — one more wrong move the agent can no longer make. The system is harder to break tonight than it was this morning, and easier to steer, because an agent spent the morning trying to break it.
That’s rope-a-dope as architecture. You don’t out-power the thing you built, and you don’t babysit it either. You build a ring it can only win in — one that absorbs every blow, holds the whole fight in a queryable ledger, and reshapes itself each round so the only move left is the right one.
These ideas are not new. Append-only logs, sagas, circuit breakers, least privilege, idempotent compensation — our craft has reached for all of it for decades and kept falling short, because the discipline was expensive and the payoff was abstract. The agents make the payoff concrete. They throw the punches that make the old ideas finally worth the cost.
So we train for the beating. We take it up against the ropes. And we steer the agent into the only move it has left.
Float like a butterfly.