6 min read
Hello Distributed System, My Old Friend
Everything Old is New Again
- agents
- ai
- engineering
- infrastructure
This is the story of how my personal AI, Grace, started life as a flat-file monorepo and turned into a 200-agents-and-counting Kafka-based microservice architecture running on a Talos cluster in my basement.
Hello distributed system, my old friend.
Genesis
In the beginning, David created a monorepo named Grace. And he saw that it was good fine.
Markdown, prompts, skills, scripts. Growing pile of context that made the next task a little easier than the last.
Next, harnesses. Guardrails. Lifecycle hooks. More and more domains. Still mostly fine. But then, as many people discover with agentic work, the more I scaled it the worse it got.
I ran into several problems:
- Context explosion. I’d blow my token budget by 10 am
- Routing. It had no idea how to route any given question or command
- Merge conflicts. Same thing humans run into; worktrees collided
- No remote access. I wanted to use it via my phone
- Workflow friction. I was so sick of approving git push for just logging info
- Split brain. What Grace “knew” on my laptop wasn’t what it knew on my desktop
As I started to address these, a familiar architecture started to emerge. The event-driven system I’d spent the last nine years building at Vevo. I leaned in. Hello, my old friend.
Stack
- 1Password for secrets.
- chezmoi for dotfile management.
- Garage for S3-compatible object storage — files, PDFs, podcast audio.
- GHCR for service container images.
- Hermes for the agent runtime — an OpenAI-compatible gateway that routes to models through OpenRouter.
- Kafka for events.
- OpenAPI for service contracts, so agents and services share the same shapes.
- OpenTelemetry for traces, with Grafana for viewing them.
- PostgreSQL for structured state, with DuckDB for analytics on top of it.
- Terraform for infrastructure provisioning.
- Velero for cluster backups to Cloudflare R2 — hourly file-system snapshots of the stateful namespaces, plus daily cluster-config.
Services
Each functional domain is its own FastAPI service. Almost all expose a REST API with an OpenAPI spec so agents can learn them, and schema changes ride Alembic migrations. Services can talk to each other, but network rules and Grace invariants require external calls to go through a GraphQL gateway.
Grace currently is decomposed into 17 services, but notable ones are:
- peeps, my personal CRM, tracks my relationships. A launch daemon runs on my Mac, monitoring the iMessage db. Any change triggers a new message to be pushed to a Kafka topic. The consumer resolves each one to a Peep (person), and the message gets processed and stored against that person’s history. A Peeps iOS app lets me add notes on the fly. Now I can ask “Hey remind me what Bob’s partner’s name is so I don’t have to reveal that I forgot because we haven’t hung out in like 5 years.”
- eco, my recommender service (named after philosopher Umberto Eco), keeps an “antilibrary” of media I haven’t seen yet, associated with one or more Peeps. That way once I read that book my friend recommended years ago, it’s a callback. “Dude, finally read that. Loved it. Thanks again.” There’s an Eco iOS app, as well.
- podcasts generates personalized podcasts every morning. The hosts are crass, have New Zealand accents, make cat puns, and share the latest news on AI developments, the housing market, whatever I’m interested in, all directed at ME. The voices are ElevenLabs. And the most mind-blowing podcast I have is one where they comment on THE SYSTEM ITSELF 🤯. It’s Grace dreaming of ways to improve itself. It’s wild. The podcast feeds get synced to Cloudflare so I can subscribe to them via Overcast.
Workflows
The grace repo is now very lightweight. It’s just an entrypoint with some starter skills and a directory of services and domains. It quickly routes to the correct repo for the question or command. Agents, skills, and config for each domain live in that repo. Clients and iOS apps as well, which makes it very easy to test that the clients and API are in sync. Plans and specs get published to Proof for review before any code lands.
Top-level skills worth highlighting are:
- grace-repo: Generates a base repo with grace config. Invariants, operational vocab for the domain, agents, skills, etc. Some repos are for modules and others are solely for documentation. Adds an entry to the top-level repo discovery config.
- grace-service: extends
grace-repoto provision a service that will run in the Talos cluster. Assigns a hostname, generates an https cert, adds a TailScale route, adds a Kafka consumer and/or REST endpoints which conform to spec if that’s needed. Spins up a postgres db if that’s needed. Provisions backups.
Running it in the basement
The whole thing runs on a Talos Linux cluster in my basement. Tailscale gives me split-horizon DNS, so the same hostnames reach the cluster from my devices without anything being exposed to the public internet, and Traefik terminates TLS so every service gets a legit https cert issued by Cloudflare.
Yes, I had to upgrade the RAM on the main node from 32 to 128 GB after this decomposition.
No, I don’t want to talk about how much that cost.
Result
Despite the increased complexity and resource overhead, it’s solved a lot of those problems I had been running into. Global skills enforce a consistent structure for services and can quickly roll changes across them. Kafka schemas enforce conformity by new message producers and consumers.
A few immediate benefits:
- I have not run into a single merge conflict.
- My agents can work in a massively parallel fashion across more domains.
- My daily token footprint has dropped significantly.
- Change isolation. A known benefit of microservices; a change to one system doesn’t affect others.
Open questions:
- I need to start having agents generate more integration tests.
Memory and Context management
Very much still a work in progress 😅. More to come. For now, I’m storing all sessions raw on a Garage bucket in the cluster, and then a synthesized version of the session in a different bucket.
Synthesis asks:
- What were friction points that agents ran into?
- Could this session have been done using a cheaper model?
- etc
Once a day, a scheduled automation looks over the synthesized sessions and suggests improvements. I implement the ones which seem reasonable.
What’s still in progress is the classic problem of cache invalidation. Memory and context are a cache, at the end of the day.
There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.
Deep dives to come
This post is the map. I’ll go one layer at a time in the posts that follow: the telemetry apps and event design, the agent harnesses, the control planes, the Talos setup, monitoring and alerting, and the peeps CRM schema I’m most excited to write about.
Everything old is new again; AI didn’t make queues, schemas, and observability obsolete. It made them matter more.