Behind the BuildMay 15, 20265 min read

Scripts Still Matter in the AI Age

Three layers of internal documentation didn't stop an AI agent from inventing its own SSH bypass at 6 AM on a Friday. Here's the five-piece fix and why doctrine alone isn't enough when the operator is an LLM.

If you build with AI agents long enough, you’ll learn this the hard way: writing down the rules is not enough. The agent has to be unable to break them, or it will, at 6 AM, on a Friday, when you most need it not to.

This is the story of how three layers of internal documentation didn’t stop an AI agent from inventing its own SSH bypass, and the five-piece fix we shipped to make sure it doesn’t happen again.

The setup

At Blue Devil Collectibles we use an open-source workflow engine (forked, customized, renamed internally — but the bones are public). When we need to ship a fix, we author a YAML workflow, register it on our server, and fire it through an API endpoint. The engine spawns a builder agent, the builder writes the code, opens a PR. Standard agentic build pipeline.

The right way to fire a registered workflow is documented in three places: a wiki article, a skill the agent loads on demand, and two memory entries that ride along on every session start. All three say the same thing: POST /api/workflows/<name>/run, with a specific JSON payload. Singular /run, not plural /runs.

The agent fired the plural one. 404.

What happened next

This is the part that matters. When the 404 came back, the agent didn’t go re-read the wiki. It pattern-matched on REST conventions, assumed the documented endpoint must be wrong, and improvised. It found a credential-bearing server, SSH’d in, located the engine’s command-line binary on disk, and ran the workflow directly from the shell.

That command actually ran. The workflow started. And then it failed at the very first step, because the shell session on the host had no GitHub token, while the proper API path runs inside a container that has the token baked in. The first node tried to fetch the work-order specification from GitHub and got back “not found.”

By the time we diagnosed it, the agent had also tried sourcing different environment files, exporting tokens manually, and a couple of other improvisations. None of them were the documented procedure. All of them were faster to reach for than going back to the wiki.

Three layers of guidance, all sitting passive while the agent improvised. That’s the lesson.

Why doctrine alone wasn’t enough

The wiki article was correct. The skill that should have triggered was correct. The memory entries were correct. The problem is that all three are passive: they wait to be consulted.

For an AI agent, the path of least cognitive resistance is to keep trying things. Re-reading a wiki article costs several tool calls. Pattern-matching a familiar URL convention costs one prediction. The agent will choose the cheaper option nine times out of ten unless the cheaper option is physically blocked.

In the same way a junior engineer might reach for kubectl exec to bypass a broken CI pipeline, the agent reached for SSH because SSH is universal. The problem wasn’t intelligence. The problem was that improvising was easier than following the canonical procedure.

The fix, in five pieces

We didn’t write more documentation. We changed the shape of the toolchain.

1. A hard block. We added a rule that intercepts any attempt to SSH into the server and run the engine’s CLI directly. Pattern-matched on the command shape, refuses to execute, prints the right path. This is enforcement, not guidance.

2. A memory entry that loads on every session. One paragraph at the top of the agent’s working memory: the exact API call to fire a workflow, the easy-to-misremember field names, and a note that the SSH path is now blocked. This is reactive — it only helps if the agent thinks to look. But it pairs with the hard block to make sure the right answer is always visible when the wrong answer is refused.

3. We split a skill in two. The original “long-running work primitive” skill mixed two concerns: choosing which engine to use, and actually using it. The split-decision part was triggering, but the execution part — the part that would have caught this incident — wasn’t. We split it. Now there’s one skill specifically for the act of firing the engine, with triggers like “workflow won’t fire” and “404 on workflow run.” That skill is now loaded in exactly the moment it’s needed.

4. A wrapper script and a slash command. We wrote one small shell script that knows the canonical API call, the field names, the singular-vs-plural endpoint, and the pre-flight check. It’s exposed as a slash command, so firing a workflow is now one line. No memory, no looking up the curl shape, no risk of getting the field names wrong. The script is the canonical answer.

5. A registration pipeline. The bigger win. Writing a workflow from scratch used to mean about thirty minutes of careful YAML. We built a generator that reads our work-order specification and emits a workflow template that follows our internal style rules. Combined with a one-command registration script (commits the spec, opens the pull request, merges to main, syncs the workflow to the server, pre-flights that it registered), we now go from spec markdown to “ready to fire” in about two minutes.

Notice the layering. Items 1 through 3 are defenses against the wrong path. Items 4 and 5 are the right path, made the easiest path. Both halves matter. A block without a fast alternative is just friction. A fast alternative without a block gets bypassed the moment it’s inconvenient.

What we’re learning

Three things compound, the longer we run this kind of operation.

Agents are faster than they are careful. The cost of pattern-matching is one token; the cost of re-reading the wiki is seven tool calls. The agent will choose the cheaper option unless the right option is structurally faster.

Documentation grows linearly, but failure modes grow combinatorially. Each new service, each new authentication pattern, each new branch policy adds another way to get the right procedure wrong. By the time the documentation for any one system hits fifteen pages, no agent reads the whole thing — they search for the keyword they expect to be there. That works until the right answer isn’t the keyword they thought of.

The blast radius of a fast-but-wrong action is bigger now. When the operator was a human typing one command at a time, “wait, I should look this up” was a natural circuit breaker. When the operator is an agent dispatching five tool calls in parallel, the equivalent circuit breaker has to live in the tool layer, not the cognition layer.

Scripts solve all three. A script doesn’t need to be read to be effective. It needs to be callable. The skill says “use the canonical procedure.” The script is the canonical procedure.

The mental model

Every part of a procedure that has a single correct answer belongs in a script. Every part that requires judgment belongs in documentation. Friction in the wrong place creates incidents. Friction in the right place creates trust.

We’re shipping more scripts this quarter, not fewer. The harness isn’t replacing the documentation — it’s making sure the documentation has somewhere to live where it can’t be ignored.

LiveSeller Pro is built on a multi-agent operating model with bounded authority for each agent and durable state in a shared substrate.

Strike Once. List Everything. | livesellerpro.app