6 min read

Hello Distributed System, My Old Friend

Everything Old is New Again

Four-panel comic titled 'Hello Distributed Systems, My Old Friend'. Panel 1: a stressed developer at a desk yelling about a monolith that won't scale, with 'WORKTREE COLLISION!' and 'TOKENS EXCEEDED!' speech bubbles. Panel 2: the developer hands the monolith to a red devil ('Hey Satan! Can you decompose this monolith?' / 'Certainly, David. For a price.'). Panel 3: the developer holds two glowing orbs before a sprawling microservices graph, marveling 'Wow! So scalable! But why isn't anything talking to each other?!' amid 'DATA INCONSISTENCY' and 'NETWORK FAILURE' labels. Panel 4: the developer sits cross-legged, screaming, surrounded by red 503 / TIMEOUT / CIRCUIT BREAKER / FAIL tiles, captioned 'I've traded a large fire for a thousand tiny, interconnected fires... Hello, distributed systems, my old friend!'
Only kidding. It was a painful migration, but it's the right shape and it's worked great so far.

This is the story of how my personal AI, Grace, started life as a flat-file monorepo and turned into a 200-agents-and-counting Kafka-based microservice architecture running on a Talos cluster in my basement.

Hello distributed system, my old friend.

Genesis

In the beginning, David created a monorepo named Grace. And he saw that it was good fine.

Markdown, prompts, skills, scripts. Growing pile of context that made the next task a little easier than the last.

Next, harnesses. Guardrails. Lifecycle hooks. More and more domains. Still mostly fine. But then, as many people discover with agentic work, the more I scaled it the worse it got.

I ran into several problems:

As I started to address these, a familiar architecture started to emerge. The event-driven system I’d spent the last nine years building at Vevo. I leaned in. Hello, my old friend.

Stack

Services

Each functional domain is its own FastAPI service. Almost all expose a REST API with an OpenAPI spec so agents can learn them, and schema changes ride Alembic migrations. Services can talk to each other, but network rules and Grace invariants require external calls to go through a GraphQL gateway.

Grace currently is decomposed into 17 services, but notable ones are:

Workflows

The grace repo is now very lightweight. It’s just an entrypoint with some starter skills and a directory of services and domains. It quickly routes to the correct repo for the question or command. Agents, skills, and config for each domain live in that repo. Clients and iOS apps as well, which makes it very easy to test that the clients and API are in sync. Plans and specs get published to Proof for review before any code lands.

Top-level skills worth highlighting are:

Running it in the basement

The whole thing runs on a Talos Linux cluster in my basement. Tailscale gives me split-horizon DNS, so the same hostnames reach the cluster from my devices without anything being exposed to the public internet, and Traefik terminates TLS so every service gets a legit https cert issued by Cloudflare.

Yes, I had to upgrade the RAM on the main node from 32 to 128 GB after this decomposition.

No, I don’t want to talk about how much that cost.

Result

Despite the increased complexity and resource overhead, it’s solved a lot of those problems I had been running into. Global skills enforce a consistent structure for services and can quickly roll changes across them. Kafka schemas enforce conformity by new message producers and consumers.

A few immediate benefits:

Open questions:

Memory and Context management

Very much still a work in progress 😅. More to come. For now, I’m storing all sessions raw on a Garage bucket in the cluster, and then a synthesized version of the session in a different bucket.

Synthesis asks:

Once a day, a scheduled automation looks over the synthesized sessions and suggests improvements. I implement the ones which seem reasonable.

What’s still in progress is the classic problem of cache invalidation. Memory and context are a cache, at the end of the day.

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

Deep dives to come

This post is the map. I’ll go one layer at a time in the posts that follow: the telemetry apps and event design, the agent harnesses, the control planes, the Talos setup, monitoring and alerting, and the peeps CRM schema I’m most excited to write about.

Everything old is new again; AI didn’t make queues, schemas, and observability obsolete. It made them matter more.

← All posts