19 min read

The Kinshasa Manifesto

Event-based distributed architecture patterns antifragile to the agentic era

Note: I am not a LinkedIn Guru™ nor do I wish to be. Some of these ideas I am confident in and have tested, others need the tires kicked. Like anything great that we build, I want this to be collaborative. Let’s human on this one. This is, and forever will remain, a rough draft.

Muhammad Ali standing over a fallen Sonny Liston, fist cocked, taunting him to get up

Float like a butterfly, sting like a bee. Yes I am aware this is from a different fight with Sonny Liston but it’s a better picture and you get the point let’s move on.

October 30, 1974. Kinshasa (now, the Democratic Republic of the Congo). “Rumble in the Jungle” was an unwinnable fight, and Ali knew that. George Foreman was 7 years his junior and in the prime of his career; a fearsome puncher who overwhelmed his opponents with sheer power. So Ali did a Kobayashi Maru. Hacked the simulation. Foreman’s advantage would be made his greatest weakness. Spent months learning how to take a beating, up against the ropes. Rope-a-Dope. For seven rounds, Ali took that beating. Played possum. By the 8th round, Foreman was exhausted. As planned, Ali sprung from defense into offense, landed a choreographed sequence ending with a right hand to Foreman’s chin, dropping him. It was Foreman’s first professional defeat.

The following principles are not new. This is the purest form of distributed systems. What our craft has strived for and fallen short. And now, not only can we, we must.

Welcome to Kinshasa.

We are prepared, and winning is the only option.

Goals

  1. Empower humans to do their best work
  2. Do not create more problems for humans
  3. Don’t just defend against agents — steer them

That’s the charter. The first two lines are the easy part: if a pattern doesn’t make a human’s best work more likely, or it quietly hands a human a new problem to babysit, it doesn’t belong here.

The third line is the one that turns defense into design. The best analogy I have is self-documenting code — code structured so the reader falls into the right understanding without a comment holding their hand. The names, the shapes, the boundaries are the documentation. A Kinshasa architecture aims to be self-governing in exactly that way: structured so the agent falls into the right move without a human standing over it. The guardrails aren’t bolted on top. They’re the shape of the thing.

A note on scope before we climb in: this is intentionally stack-agnostic, with two deliberate exceptions. The pattern leans hard on two concrete things, and naming them beats hand-waving.

The first is a durable, ordered, replayable event log — Kafka. Not the only option, just the most battle-proven distributed log we’ve got, and saying “a log” when I mean that would be coy. The second is the gateway, which I draw as GraphQL: the pattern needs a single typed front door where every mutation is already a declared command, and that’s exactly what a stitched GraphQL schema hands you — one place to mint the run ID and seat the circuit breaker, instead of twelve. Every other box — the services, the coordinator — is a role you can fill with whatever you like. But when I say “the log,” I mean Kafka, and when I say “the gateway,” I mean a GraphQL door, specifically.

And I’m deliberately not talking much about the agents themselves. This is about the deterministic part of the system — the part you actually control. The agents are the weather. This is about building a house that stands up in the weather.

Antifragility: architecture as governance

Here’s the move, and it’s the whole thing: you cannot out-power the agent.

The instinct, when you first hand real work to an autonomous agent, is to stand over it and throw counter-punches. Watch every move. Approve every step. Catch every mistake by hand. That’s a human being Foreman — trying to win on power. It doesn’t scale and it misses the point. The agent is faster than you, tireless, and capable of throwing a thousand writes while you’re still deciding how to block the first one.

So stop boxing. Lean on the ropes. Let the structure do the governing.

The principle underneath every section here is antifragility, and it’s worth being precise about the word, because it gets thrown around loosely. There’s a gradient:

Most “resilient” architecture aims for robust — it shrugs off the hit and carries on exactly as it was. Kinshasa aims one notch past that. Ali didn’t survive the eighth round. He used it. An antifragile system treats every agent screwup as fuel: the failure becomes a rule, a threshold, a test, a gate that wasn’t there before. The system gets harder to break the longer it runs.

That’s what I mean by architecture as governance. The structure isn’t a fence you build and forget. It’s the thing absorbing force, redirecting it, and tightening itself every time it gets hit. Control lives in the world the agent moves through, not in a human standing over it.

Attributes

Six principles make a system Kinshasa-shaped, and the first is the one the other five exist to produce:

  1. Antifragility — it gets stronger every time it’s hit.
  2. Observability — you can see what happened.
  3. Immutability — you can’t quietly erase what happened.
  4. Minimal blast radius — when it goes wrong, it goes wrong small.
  5. Recoverability — you can undo it, at scale.
  6. Punching — go on offense: steer the agent until the only move left is the right one.

Antifragility is the goal. Observability through recoverability are how you take the beating. Punching is how you win the fight.

Architecture overview

Before we go principle by principle, here’s the whole thing on one page.

The architecture split into two zones by the gateway. Above the gateway is the nondeterministic zone, holding the agent run. The gateway — which mints the run ID and holds the circuit breaker — is the trust boundary, and it writes commands straight to the command topics. Below it is the deterministic zone: command topics on Kafka carry both forward and compensate commands and feed the services, which write canonically and emit domain events to the event topics; a durable coordinator reads the event ledger and writes compensating commands back to the command topics in reverse order

The gateway is the membrane. Above it sits the only nondeterministic thing in the system — the agent. Everything behind the gateway is deterministic: replayable, ordered, idempotent. And the gateway doesn’t call services — it writes a command straight to the log. From there one deterministic core does the work, blind to who produced the command: the gateway for forward writes, the coordinator for compensations. Forward and reverse are the same machine, run by different hands.

It’s three moving parts and one boundary, and the boundary is the point. The gateway is the only door: every agent action comes through it, picks up a run ID, and passes the circuit breaker. Behind it, the system is unglamorous event-driven plumbing — the gateway writes a command to a Kafka topic, services consume commands and emit events back to Kafka, and a durable coordinator watches the event log so it can push compensations back through the same command log when a run goes bad.

The line across the middle is the whole game. The agent is the only nondeterministic actor in the picture. Everything behind the gateway is deterministic — ordered, replayable, idempotent — and that is not an accident of implementation. You make an agent’s blast radius governable by making everything it can touch behave like a machine instead of a coin flip. The rest of this post is six ways of saying that.

Observability

You can’t operate what you can’t see, and you definitely can’t undo what you can’t see.

The non-negotiable: every event tracks its actor. Human, service, or agent — every change carries the identity of who or what caused it. And events span across services, so a single causal chain that touches six systems reads as one story instead of six disconnected log lines.

This sounds like table-stakes telemetry, and the “structured logs, traces, metrics” part is. But there’s a sharper requirement hiding here that the rest of the architecture leans on entirely: you must be able to ask “show me everything this one actor did” and get a complete, queryable answer across the whole system. Hold onto that. It’s the linchpin that makes recovery possible at all.

Immutability

The cheapest way to make the past recoverable is to refuse to destroy it.

The rule: database entries can only be inserted or selected. No deletes, no in-place updates. A “delete” is a soft-delete — a tombstone, not an erasure. An “update” is a new version that supersedes the old one without obliterating it.

Immutability is what gives you something to roll back to. You can’t restore a value you overwrote into oblivion. Append-only is recoverability’s foundation, paid for up front.

Minimal blast radius

When — not if — an agent goes sideways, the damage should be contained to one small room.

Three levers, all old, all underused:

There’s a discipline that belongs here too: don’t batch operations in ways that multiply blast radius. A convenience that fans one action out across a dozen services has just made every failure twelve times worse. The smaller and more bounded each unit of work, the smaller the worst case.

Recoverability

This is the heart of it, so we’re going to spend real time here.

Reframe the word first. Recoverability is not “we can restore from backup.” Backups are the floor — the thing you fall back to when everything else has failed. Real recoverability is sharper: every agent-originated change is correlated, reversible, and rate-limited; and anything that isn’t reversible is gated before it ever happens.

The problem backups can’t solve

Picture the failure that keeps you up at night. An agent run issues thousands of incorrect writes, and they fan out across multiple services. Now try to fix it with a per-service backup restore. You can’t:

Backups are a sledgehammer. This needs a scalpel.

Rollback doesn’t exist out here. Compensation does.

This is the single most important distinction in the whole post, and it’s the one people get wrong:

True distributed rollback across autonomous services and Kafka and external API calls cannot exist. Not “hasn’t been built” — cannot. It would require global two-phase locks, which is exactly the synchronous coupling event-driven systems are built to escape, and which can’t span an external payment API anyway. Events are immutable facts; you can’t un-publish one that’s already been consumed. You undo by appending corrective events. The workflow engines that look like they “roll back” are just orchestrating compensations you wrote — no engine conjures an undo out of nothing.

So: never reach for “distributed rollback.” Reach for compensation, plus a stack of cheaper containment layers underneath it.

The recovery gradient

Strongest and cheapest at the top; weakest and most expensive at the bottom. You want to catch as much as possible high on this list.

  1. Real rollback — within one service. If a multi-write operation fits inside one service’s transaction, keep it there. ACID does the work. It’s free, it’s atomic, it’s traceless. Don’t distribute what doesn’t need to be distributed.

  2. Pseudo-rollback — the quarantine window. This is the primary lever, and it’s beautiful in its simplicity. Writes buffer as pending and do not propagate — no events emitted, nothing downstream sees them — until a short delay elapses. Abort inside that window and the un-propagated writes are simply discarded. Nothing ever observed them, so it’s a true rollback with zero compensation needed. Tuning the width of that window is how you catch most runaway runs cheaply: the more damage you can keep inside the window, the more of it you can erase for free.

  3. Event-driven compensation — for what escaped. This is the saga, and it’s the expensive tail case, not the common path. Eventually consistent, trail-leaving, real work. We’ll walk it below.

Underneath all three sits a hard rule: irreversible egress is gated before commit, never compensated after. You cannot un-send an email or un-charge a card with a clever event. So anything with no inverse doesn’t get to ride the recovery gradient at all — it gets stopped at the door and held for approval.

That gives you two distinct controls, and conflating them is a classic mistake:

Correlation is the linchpin

Here’s where Observability cashes its check. The unit of control is the agent run, not the agent. An agent identity spans thousands of runs over its life; that’s far too coarse. A single run is the bad batch you actually want to grab.

So a single chokepoint — the gateway every write passes through — mints a run ID at the start of each run and propagates it into every downstream write. Each service stamps that ID onto the rows it touches. The payoff is enormous and simple: “everything this run did” becomes a query. One WHERE run_id = X across the whole system, and the blast becomes a selectable set. Every recovery mechanism below is built on top of that one capability.

The circuit breaker

Now the rope-a-dope gets mechanical. The circuit breaker is scoped per run and enforced at the gateway — the one place that sees the entire fan-out. And it plays two roles at once, which is the whole point: it’s both the stopper (it halts forward writes) and the trigger (it kicks off reversion of what already landed).

It watches a handful of counters per run:

And it borrows its states straight from the electrical part it’s named after:

When it trips OPEN, two things fire simultaneously. It halts all further forward writes for the run. And it triggers reversion — flipping any of the run’s still-quarantined writes from “auto-release soon” into a manual, alerted hold, and kicking off compensation for anything that already escaped the window. Lower thresholds keep more of the damage inside the window, where it’s free to discard. That’s the dial you tune.

Reversion: the gateway is the doorbell, not the demolition crew

When compensation does have to run, it runs over Kafka — not as a pile of synchronous calls from the gateway. The gateway only rings the bell. A durable coordinator drives the actual work, and each service applies its own inverse. That division of labor matters, so let’s be exact.

On the forward path, every canonical write emits a domain event carrying the run ID and a before/after image. A saga-log consumer projects those into a per-run ledger — the ordered record of everything the run did. The breaker enforces itself right here, on the way through.

On the reverse path:

  1. A trip — or an explicit “revert this run” command — emits a revert-requested event for the run.
  2. The coordinator consumes it, reads the ledger, and emits each step’s declared inverse in reverse order, keyed by entity.
  3. Each service consumes its compensate command, applies the idempotent inverse in a local transaction, and emits a compensated event.
  4. The coordinator collects the acks, retries stragglers, and — when they’re all in — marks the run reverted.

The coordinator is a durable-execution engine — the category that keeps workflow and step state in a database so it survives its own crash and resumes from the last completed step mid-rollback. (Temporal and DBOS are two in this family.) Crucially, it orchestrates the compensations; it doesn’t invent them, and it never sees a service’s internals.

Compensation is a domain inverse, owned by the service

This is the part most people get wrong, so it’s worth dwelling on. The tempting “free” approach is mechanical row-inversion: have every change journal a { table, primary key, before-image }, and to undo, just slam the old row back in. It looks elegant. It’s quietly wrong.

So the inverse belongs to the service that owns the domain. Each forward operation has a declared inverse the service implements and exposes: archived → unarchive, posted → reversed, linked → unlink. The coordinator’s ledger holds only (run_id, step_id, a reference to the forward event) and dispatches each declared inverse in reverse order. It deals purely in the contract — no table, no primary key, no raw row image ever crosses the boundary. Where an undo needs a prior value, the service chooses to carry it as domain data in the event (renamed { old_name, new_name } → inverse rename to old_name), scoped to exactly what reversal needs.

The honest tradeoff: this is more code than the row-restore fantasy. Every service writes a compensating handler per forward operation. What you buy is correct semantic inverses, intact service boundaries, invariants enforced at compensation time, and a coordinator that’s blind to schema. The mechanical approach was only ever “free” because it was wrong.

And some cases stay irreducible, no matter how clean your design: the inverse of an event is usually a distinct compensating event, not a literal negation. Inverses must be idempotent so a replayed revert is a no-op. Egress has no inverse at all — gate it before commit. And lossy operations may only partially compensate. The architecture doesn’t pretend these away; it makes them the small, named exceptions instead of the silent default.

The operator in the loop

Automatic detection, manual decision. When an agent goes rogue, a human gets pulled in — not to watch the firehose, but to make the one call that matters.

A trip, an anomaly, or any run flagged rogue pages a human operator. The alert is self-contained: which agent and entry point, what tripped (the count, the velocity, the service span), a summary of what the run touched, and the current state — what’s still held in quarantine versus what’s already propagated. No new dashboard to go hunting through. The decision comes to you.

And the decision surface is small on purpose. For the flagged run:

Because everything is keyed by run ID, this scales to bulk trivially. “Compensate a bunch of stuff” is just a set of run IDs: one run, several runs, every run from a given agent or entry point since time T, a whole time range. The coordinator expands the selection and drives all their compensations as one operator action. It’s idempotent and replay-safe — re-reverting an already-compensated run is a no-op, and overlapping bulk selections are fine. Egress is never in the compensable set, so a bulk revert can only undo recoverable state; it can’t un-send.

One default worth stating out loud: when the human is unreachable, hold. A tripped run’s quarantined writes stay held indefinitely — nothing propagates — and because the breaker is per run, holding one run blocks nothing else. If you want a timeout, make it default to discard, not release. Held-then-ignored beats held-then-released-unreviewed every single time.

Punching

Seven rounds of defense, and we still haven’t thrown a punch. Here’s the eighth — except the real offense was happening the whole time, and it doesn’t look like a punch at all. It looks like steering.

Here’s the reframe that ties the whole thing together: the architecture steers the agent into correct behavior, because there is no other behavior available. That’s the offense. Not a counterpunch you throw after the agent screws up — a ring shaped so the agent can only move where you want it to.

Think back to Goal 3, and to self-documenting code. Good code doesn’t make you read it correctly; it’s shaped so the correct reading is the only one that fits — the names, the types, the boundaries leave no room for the wrong interpretation. A Kinshasa architecture does the same thing to an agent. The wrong move hits a wall: a gate, a breaker, a rejected write. The right move flows downhill. So the agent does the right thing — not because a human is standing over it, and not because it’s especially well-behaved, but because the structure has quietly removed every other option. The path of least resistance is the correct path. That’s the strongest form of control there is: the kind nobody has to enforce.

When steering isn’t enough — when an agent finds a way to be wrong inside the rules — the same machinery becomes a literal punch. Correlation turned “ten thousand writes across six services” into a single thing you can knock out with one motion. The breaker is the jab that freezes the opponent; bulk compensation is the combination Ali threw in the eighth — one operator action that drops an entire rogue run, or every run from an agent since lunch, in a single swing.

And then the system learns the punch. This is the antifragile core, and it’s one principle stated five ways:

A failure that doesn’t change the system is a failure you’ll see again.

So every absorbed hit tightens the steering. The threshold that should have caught the run gets lowered. The inverse that was missing gets declared. The operation that escaped the window gets a narrower window, or a gate. Each correction is one more wall in the maze — one more wrong move the agent can no longer make. The system is harder to break tonight than it was this morning, and easier to steer, because an agent spent the morning trying to break it.

That’s rope-a-dope as architecture. You don’t out-power the thing you built, and you don’t babysit it either. You build a ring it can only win in — one that absorbs every blow, holds the whole fight in a queryable ledger, and reshapes itself each round so the only move left is the right one.

These ideas are not new. Append-only logs, sagas, circuit breakers, least privilege, idempotent compensation — our craft has reached for all of it for decades and kept falling short, because the discipline was expensive and the payoff was abstract. The agents make the payoff concrete. They throw the punches that make the old ideas finally worth the cost.

So we train for the beating. We take it up against the ropes. And we steer the agent into the only move it has left.

Float like a butterfly.

← All posts