the last hop can't be CI
You built an agent whose front door is something
CI can't drive — a phone call, a voice provider, a webhook console you don't
host. Your unit tests are honest about the parts they reach: the tool
implementations, the auth check, the ledger. But the conversation
itself goes untested. Does the persona hold for a whole exchange? Does
the model call repo_state instead of confidently inventing the
project's status? Does the entire auth-marshalling-ledger chain actually
hold up when a live model — not a fixture — is the thing driving it?
The usual answer is "call it and see," which means a human on every check and a provider key in every environment. That's not a test, it's a chore — and it's exactly the chore that doesn't get done. The groundskeeper, this project's voice agent over its own repo, hit this wall directly: voice can't be CI because the founder isn't always on the call, and the conversation key may not even be provisioned. So the conversation layer — the most important layer — was the one nothing covered.
the DEFINITION
1. a text self-demo of a voice-agent conversation: scripted human lines drive a live model wearing the production persona, and every tool call runs through the real production router — no external service, no human. The output is a committed transcript.
A rehearsal is three things at once — a test, a demo, and a document. It tests because a model really drives the conversation through real code. It demos because the side effects are real (more on that, honestly, below). And it documents because the result is a datestamped org file you can read the way you read any plan. "Persona held" stops being a vibe and becomes a line you can point at, diff, and revert.
cut at the last SEAM
The whole idea is one decision: where do you put the knife? Stack the conversation as a layer cake. At the top is the telephony hop — the provider's network, their voice synthesis, their webhook delivery. That's theirs; you don't own a line of it. Everything below it is yours: the persona, the model, the router's auth header, the five tool implementations, the task ledger. A rehearsal cuts at the highest seam it can reach and exercises everything underneath:
flowchart TD tele["telephony · voice synth · webhook delivery
— the provider's, NOT yours —"]:::out persona["persona + live model
does it hold? does it call tools?"] auth["router auth — x-gk-secret, fails closed"] tools["the five tool implementations"] ledger["the task ledger"] tele -. "the one hop a rehearsal can't reach" .-> persona persona --> auth --> tools --> ledger classDef out fill:#e7e4dc,stroke:#9a9a9a,stroke-dasharray:5 4,color:#6a6a6a style persona fill:#13d943,stroke:#121316,stroke-width:2.5px style auth fill:#9fc4e8,stroke:#121316 style tools fill:#aee5c2,stroke:#121316 style ledger fill:#f2ddb0,stroke:#121316
Read the picture top to bottom. The grayed, dashed box at the top is the only thing left untested — the provider's telephony, struck out because it's not your code. Below the cut, four solid layers light up: persona and model in green, then auth, then tool implementations, then the ledger. The knife goes right under the provider, so the untested surface is as small as it can possibly be — one hop you genuinely don't own, and nothing else.
one line, six ROUNDS
Mechanically, a rehearsal is a reduce over scripted human lines. You hand
Rehearsal.run/2 a list of "founder" lines; the agent side is a
live model. For each line, the model completes — and if it wants tools, it
calls them, the results are appended, and it completes again, up to
@max_tool_rounds 6 rounds before it's forced to answer in
plain words. One founder line, then, can chain several tool calls before the
reply lands:
sequenceDiagram participant F as founder line (scripted) participant M as live model (persona) participant R as the real router participant L as the ledger F->>M: a scripted human line M->>R: round 1 — repo_state(summary) R->>L: read state L-->>M: grounded answer M->>R: round 2 — dispatch(goal) R->>L: file task gk-201 L-->>M: dispatched Note over M: ≤ 6 rounds, then forced to answer M-->>F: plain reply — written to the transcript
Walk the exchange. A scripted line goes to the model. The model asks the
real router for repo state and gets a grounded answer back from the ledger.
It dispatches a background goal, the ledger files a task, the router confirms.
After at most six such rounds, the model is forced to stop calling tools and
answer in words — and that reply, plus every tool call along the way, is
written into the transcript. One detail keeps it faithful to production: the
assistant message sent back to the model API is stripped to just its role,
content, and tool calls — the same echo convention the real agent loop uses.
And if the model errors, the rehearsal doesn't crash; the turn simply reads
(llm error: …), so a flaky model never costs you the whole run.
through the REAL router
depth rung · skippable — the thirteen lines that make it not a mock
Here is the seam itself. When the model calls a tool, the rehearsal does not reach into a fake — it builds an in-process HTTP request and hands it to the literal router module that serves production:
conn =
Plug.Test.conn(:post, "/tool/#{name}", Jason.encode!(args))
|> Plug.Conn.put_req_header("content-type", "application/json")
|> Plug.Conn.put_req_header("x-gk-secret", secret)
|> Workbooks.Groundskeeper.Router.call(Workbooks.Groundskeeper.Router.init([]))
Now set that beside what the deployed provider actually sends. The
provisioned webhook tool posts to {base_url}/gk/tool/{name} with
the header x-gk-secret and the args as a JSON body. Same route,
same header, same body schema. Only the HTTP transport differs — the
rehearsal's request is constructed in memory instead of arriving over the
wire, and from there it is byte-for-byte the production path. So a single
rehearsal exercises, for free, everything that path touches: the
constant-time secret compare that fails closed with a 503 when the
secret is unset, the JSON marshalling, the five real tool implementations,
and the ledger writes. A mocked tool call skips all of it.
| a mocked tool call | the real-router call | |
|---|---|---|
| auth header check | skipped — no router | runs — constant-time compare, fails closed |
| JSON marshalling | bypassed — you pass a map | encode in, decode out, as in prod |
| tool implementation | stubbed return value | the real five, with real side effects |
| the ledger | untouched | real reads and writes |
| what's faked | everything below the call | only the wire transport |
The verdict of that table is the whole reason a rehearsal counts: a mock fakes everything below the call, while the real-router rehearsal fakes only the wire — auth, marshalling, the implementations, and the ledger all run exactly as production runs them.
one def, two CONSUMERS
A demo that drifts from production isn't a demo of production. So the
persona isn't retyped for the rehearsal — it's read from the same
definition file the live provisioning reads. Both the rehearsal and the
provider-provisioning code extract the prompt from groundskeeper.org,
from under the ** System prompt heading, and both raise if
it's missing rather than running an agent with an empty brain:
flowchart TD def[["agent/groundskeeper.org
** System prompt — the single source"]] def --> prov["provisioning → the live deployed agent"] def --> reh["Rehearsal.run → the self-demo"] style def fill:#9fc4e8,stroke:#121316,stroke-width:2.5px style prov fill:#ffffff,stroke:#121316 style reh fill:#aee5c2,stroke:#121316
One file, two arrows. The deployed agent and the rehearsal both pull their
persona from the exact same heading in the exact same file — so the personality
you prove in a rehearsal is, definitionally, the one production runs. The
** System prompt heading is load-bearing here: a def whose prompt
doesn't live under it parses to nothing and the agent runs empty — a failure
this project has actually shipped before, which is why both consumers raise
loudly instead of falling through. The authoring lesson
is where that convention lives.
the transcript is the ARTIFACT
A rehearsal writes itself down. The result lands in a datestamped org file
under rehearsals/ — a first-class directory of the method, sitting
beside the agent's sources and workflows. Each turn is a * founder
heading and a * groundskeeper heading, and any tool calls follow
as a list, each one showing the call and its result truncated to 200
characters. Here is one real turn, verbatim, from the first rehearsal that
proved the groundskeeper — the founder asks for background research, the agent
names task gk-201, then the tools fire:
* groundskeeper
Sent it out. Task **gk-201** — researching E2B and Daytona pricing
from their official pages, docs, and recent real-world mentions.
I'll bring it back when it lands.
tool calls:
- =dispatch({"goal":"Research E2B (e2b.dev) and Daytona (daytona.io)
pricing: …"})=
→ {"dispatched":true,"task":"gk-201",
"workflow":"workflows/research-e2b-…-3.org"}
- =capture({"kind":"idea","text":"The founder wants to sharpen the
workbook-as-container pitch …"})=
→ {"captured":true,"file":"sources/captures/2026-06-11.org"}
Read what that proves. The first reply in the run was a grounded
repo answer — the model called repo_state(summary) before
asserting anything about the project, exactly as its tool description demands.
This turn shows a real dispatch: the goal went to the runtime, task
gk-201 was filed, and a workflow file was written —
workflows/research-e2b-…-3.org, a six-leaf ordered org outline
with :done-when: shell gates, which still exists in the
repo. The captures were silent — saved, never read back, per the
capture tool's instruction. And the run ends with honesty you can see in
text: a later tasks({}) call returns "finished":null,
and the agent says it's still working, no results back yet. No fabricated
completion. That readback is observable, in plain text, forever.
The dispatched workflow persisting in the repo isn't a leak — it's the point. A rehearsal's side effects are real artifacts, which is exactly what makes it a demo and not a mock. The closest sibling page, dispatch, is where that spawn lane is taught in full.
where it sits on the LADDER
A rehearsal isn't the only rung — it's the middle one, and knowing its neighbors is how you read its result honestly. Below it sits ordinary CI: unit tests where every model and author call is injected, so there's no network at all — they cover auth failing closed, the capture file's shape, a crash routing to BLOCKED, the post-call HMAC check with a forged-but-valid signature. Above the rehearsal sits a rung that does use the provider: a server-side conversation simulation, which runs the agent's real deployed prompt and tool config through the provider's own text simulator — and needs the conversation key the rehearsal deliberately does not. At the very top is a live call with a human on the line.
| rung | what's real | what's faked | passing proves |
|---|---|---|---|
| CI unit tests | the tool + auth code | model + author injected, no network | each piece works in isolation |
| rehearsal | live model + real router + real tools + ledger | only the telephony wire | the conversation layer holds, end to end |
| provider simulation | the deployed prompt + tool config, provider-side | the human voice | the provider drives the agent as configured |
| a live call | everything | nothing | it actually works on the phone |
The verdict that table delivers: the rehearsal is the rung where a live model meets the real router with no provider in the loop — the most coverage you can get for the least ceremony, needing an ordinary model key and no voice key at all. It's how the groundskeeper's go-live readiness was proven while still blocked on the conversation key: shipped and proven by rehearsal, no human, no provider. Persona held. The agent went live afterward.
what a green rehearsal DOESN'T prove
Honesty section — because a test you over-read is worse than no test. Four limits, stated plainly:
The telephony hop stays untested. A rehearsal passing is not voice working. The proof of that is in the history: the missing-permission block on the conversation key was discovered by trying to go live, not by any rehearsal. The one hop you don't own is the one a rehearsal can't reach, by construction.
The side effects are real, and there is no dry-run. Captures land in
sources/captures/, dispatches really run workflows and burn model
calls, issues really file. This is a feature — it's a demo of the real thing —
but it means a rehearsal dirties your repo and spends tokens every time. Run
it knowing that.
The tool specs are hand-mirrored. The rehearsal keeps its own copy of the five tool specs, mirroring the provider provisioning's source of truth. The code's own comment admits it. That mirror can drift — if the provisioned specs change and the rehearsal's copy doesn't, you'd be rehearsing a slightly different agent than you ship.
"Persona held" is a judgment, not an assertion. A rehearsal is pass-or-fail by reading. Nothing in the run asserts that the tone was right or the answers were grounded — a human, or a judge, has to read the transcript and say so. The asserted sibling of a read-by-human transcript is a judged eval; a rehearsal is the document you'd hand the judge.
questions people actually ASK
Is this just mocking the conversation?
No — and that's the entire distinction. Nothing below the cut is mocked. The model is live, the router is the literal production module, the tools are the real five, the ledger really writes. The only thing absent is the provider's wire transport. A mock fakes the code under test; a rehearsal fakes only the network you don't own.
Why a live model instead of canned replies?
Because tool selection is the thing under test. The question a
rehearsal answers is whether the model calls repo_state instead
of guessing, whether it dispatches when the founder commits to work, whether
it captures silently. Canned replies would assert the answers you're trying
to discover.
Can I run one with no keys at all?
You need a model key — not the voice key. That's the point of the rung: it proves the conversation layer using ordinary inference, with no conversation provider provisioned. An unset router secret doesn't stop the run either; it just produces 503s inside the transcript, because failing closed is part of what gets rehearsed.
Does it dirty my repo?
Yes — deliberately. Captures, dispatched workflows, and ledger entries are real artifacts, and there's no dry-run mode. That's what makes it a demo of the real bridge rather than a simulation of it. Run it where real side effects are acceptable.
Is "rehearsal" a generic Workbooks verb?
No. It's a method, shown on one agent — the groundskeeper — in roughly a hundred and forty lines, invoked host-side as a maintainer op rather than shipped as a user command. The value of this page isn't a button to press; it's the recipe to copy for your un-CI-able agent — voice, SMS, Slack, anything provider-fronted.
So how do I actually run it?
Host-side, with a router secret and a list of scripted lines — no voice key needed:
WB_GK_SECRET=… mix run --no-start -e '
Workbooks.Groundskeeper.Rehearsal.run([
"Hey. Where are we with the project right now?",
"Can you look into E2B and Daytona pricing in the background?",
"What do you have running right now?"
]) |> IO.inspect()'
# => %{turns: [%{founder: "…", groundskeeper: "…", tool_calls: […]}, …],
# file: "examples/groundwork/rehearsals/2026-06-11-045937.org"}
The model is WB_GK_BRAIN_MODEL or the runtime's default; every
tool call hits the real router with the real secret.
keep GOING
Rehearsals are how you prove an agent — so the parent lesson is the place to start, and the neighbors fill in the pieces the recipe leans on.