Behind the BuildMay 15, 20265 min read

The Harness Builds the Harness: 14 PRs in 18 Hours

What happens when you fork an open-source workflow engine, layer a behavior policy on top, define five workflow templates, and dispatch a coordinated AI team against them in one weekend. The substrate is the durable asset.

If you’re running a small engineering shop with AI agents doing the bulk of the work, the question you wrestle with every week is throughput. One builder, working sequentially, ships five to ten work orders a week. That’s not enough when you have a paying customer install date six weeks out and five active product tracks to ship before then.

Last weekend we built the throughput multiplier. Eighteen hours, fourteen pull requests, one fork of an open-source workflow engine, and a coordinated team of three AI agents that handed work to each other through a shared substrate. This post is about what we actually did, and why the substrate — the documentation, work-order specs, manifests, and enforcement rules — outlived every individual tool decision we made along the way.

The setup

Blue Devil Collectibles is a comic-shop SaaS. We run a live-stream selling product (LiveSeller Pro), an inventory and pull-list ops layer (Shop Ops), a customer-relationship product (Cheers), and a multi-tenant storefront. We have one hard date on the calendar: our first paying brick-and-mortar tenant installs in mid-June. That’s the deadline that drives everything else.

The bottleneck isn’t ideas. It’s the rate at which finished, tested, validated code can land on main. Each work order needs a specification, an architecture review, an implementation, tests that exercise real behavior, a manifest documenting what was built, a code review across two different AI model families, and a final human sign-off. Doing that work one at a time isn’t going to land Mutant Market on schedule. So we built a parallel pipeline.

The plan in one weekend

The plan was simple to describe and harder to execute. Take an open-source workflow engine. Fork it. Patch a few bugs the upstream maintainer hadn’t gotten to. Wire it up to our model providers with automatic failover between them. Define five workflow templates covering the work classes we ship most often. Layer in a behavior policy that every agent inherits regardless of which model is running. And — the part that matters — make sure the substrate (specs, manifests, the wiki, the enforcement rules) survives every individual tool decision.

What we ended up shipping looked like this.

The behavior policy came first

Before any code, we wrote a four-principle behavior contract that every agent in our pipeline now loads at startup. Think before building. Simplicity first. Surgical changes only. Goal-driven execution. Vendor-neutral language, deliberately, so that if we route a task to a different model next quarter the policy still applies.

Why this came first: every recent failure we’d filed traced back to one of those four gaps. An agent silently guessing a database schema. An agent reformatting unrelated files. An agent declaring a task complete when the success criteria hadn’t been met. Documentation alone wasn’t catching them. We added an enforcement rule on top of the policy that blocks any significant tool call until the policy has been acknowledged in the session. Three copies of the policy file, distributed across three machines, kept in sync, hash-verified.

Forking the engine

The upstream engine — open-source, MIT-licensed, well-documented — had three minor bugs that bit us in our first round of tests. Variable substitution worked in one kind of workflow node but not another. Loop nodes defaulted to reusing context across iterations instead of starting fresh. Model selection defaulted to the most expensive tier when not declared.

Three small patches in the fork. About three hours of engineering. We pointed our production deployment at the fork build, re-tested, and the engine was solid. Total time investment: less than half a workday.

Five workflow templates

One template for new features. One for documentation and behavior-rule updates. One for tech debt and cleanup. One for bug fixes (with a “reproduce the bug in a test first” requirement). One for infrastructure work. Each template embeds the behavior policy, declares which model tier handles which kind of node, and produces a structured manifest the human reviewer can verify against the actual diff.

All five workflows registered on our server and shipped within the same day.

The control plane

The workflow engine is the dispatcher. The brain that decides which model handles which kind of work — and falls over to a backup model if the primary is rate-limited or unavailable — is a separate small service we wrote. About 200 lines of Python. It also handles a job that comes up a lot: reading a completed work order and grading whether the stop conditions were actually met. Half of the failure modes in agentic systems are “the agent declared done but the test didn’t actually pass.” A second pair of eyes on every completion catches them.

Code review across model families

The last piece we shipped that day was automatic code review on three of our customer-facing repositories. Every pull request now gets an automated review from a different model family than the one that wrote the code. We’ve written about this pattern before. The disagreement between two model families is more valuable than the agreement of two reviewers from the same family. Same blind spots are shared blind spots.

What the day actually felt like

Eighteen hours of work, fourteen pull requests landed, two doctrine misses caught and patched (one about where automated reviewers look for completion data, one about not reusing pull-request branches across multiple work orders), and a substrate snapshot worth keeping.

It wasn’t smooth. Every time we hit friction, the friction came from a place we hadn’t documented well enough. One agent got muscle-memory confused and SSH’d into the wrong production server. Another asked, in the middle of the day, how to even see the dashboard for the new system — because nothing in our discoverability layer made the URL obvious.

Each piece of friction either banked a new rule or built a new tool. By the end of the day the toolchain was visibly faster than it had been at the start.

What we’re not claiming

This is not autonomous AI building a SaaS product. The harness — that’s what we call this collection of substrate plus engine plus templates — doesn’t deploy to production. That authority sits with a human. It doesn’t make architecture decisions; those go to a designated review agent and ultimately to a human for sign-off on anything material. It doesn’t merge pull requests blindly. Customer-blast-radius work waits on a human reviewer before it lands.

What the harness does is take the mechanical, repetitive work — specification, implementation, testing, manifest construction, status flipping, cross-reviewer validation — and make it parallel. The human stays on the wheel. The harness paves the road.

Why it matters

Mutant Market installs in mid-June. Five active product tracks. One paying customer who needs every one of them to work on install day. Single sequential builder math says five to ten work orders a week. Three parallel pipelines through the harness says fifteen to thirty. Five parallel pipelines says twenty-five to fifty.

The harness isn’t the product. The product is the comic-shop SaaS. The harness is what lets one founder and one builder and a coordinated AI team ship enough finished, tested, validated product to land Mutant Market on schedule.

The next thing the harness has to do is build its own next layer — a substrate that survives across the laptop, the production server, and the AI dashboard, with work-order specifications mirrored to a place every agent can reach. That part starts next.

LiveSeller Pro is built on a multi-agent operating model with bounded authority for each agent and durable state in a shared substrate.

Strike Once. List Everything. | livesellerpro.app