rehearsals — proving a voice agent without the phone

the last hop can't be CI

You built an agent whose front door is something CI can't drive — a phone call, a voice provider, a webhook console you don't host. Your unit tests are honest about the parts they reach: the tool implementations, the auth check, the ledger. But the conversation itself goes untested. Does the persona hold for a whole exchange? Does the model call repo_state instead of confidently inventing the project's status? Does the entire auth-marshalling-ledger chain actually hold up when a live model — not a fixture — is the thing driving it?

The usual answer is "call it and see," which means a human on every check and a provider key in every environment. That's not a test, it's a chore — and it's exactly the chore that doesn't get done. The groundskeeper, this project's voice agent over its own repo, hit this wall directly: voice can't be CI because the founder isn't always on the call, and the conversation key may not even be provisioned. So the conversation layer — the most important layer — was the one nothing covered.

the DEFINITION

re·hears·al /rɪ·ˈhɜːr·səl/ noun

1. a text self-demo of a voice-agent conversation: scripted human lines drive a live model wearing the production persona, and every tool call runs through the real production router — no external service, no human. The output is a committed transcript.

A rehearsal is three things at once — a test, a demo, and a document. It tests because a model really drives the conversation through real code. It demos because the side effects are real (more on that, honestly, below). And it documents because the result is a datestamped org file you can read the way you read any plan. "Persona held" stops being a vibe and becomes a line you can point at, diff, and revert.

cut at the last SEAM

The whole idea is one decision: where do you put the knife? Stack the conversation as a layer cake. At the top is the telephony hop — the provider's network, their voice synthesis, their webhook delivery. That's theirs; you don't own a line of it. Everything below it is yours: the persona, the model, the router's auth header, the five tool implementations, the task ledger. A rehearsal cuts at the highest seam it can reach and exercises everything underneath:

flowchart TD
  tele["telephony · voice synth · webhook delivery
— the provider's, NOT yours —"]:::out
  persona["persona + live model
does it hold? does it call tools?"]
  auth["router auth — x-gk-secret, fails closed"]
  tools["the five tool implementations"]
  ledger["the task ledger"]
  tele -. "the one hop a rehearsal can't reach" .-> persona
  persona --> auth --> tools --> ledger
  classDef out fill:#e7e4dc,stroke:#9a9a9a,stroke-dasharray:5 4,color:#6a6a6a
  style persona fill:#13d943,stroke:#121316,stroke-width:2.5px
  style auth fill:#9fc4e8,stroke:#121316
  style tools fill:#aee5c2,stroke:#121316
  style ledger fill:#f2ddb0,stroke:#121316

Read the picture top to bottom. The grayed, dashed box at the top is the only thing left untested — the provider's telephony, struck out because it's not your code. Below the cut, four solid layers light up: persona and model in green, then auth, then tool implementations, then the ledger. The knife goes right under the provider, so the untested surface is as small as it can possibly be — one hop you genuinely don't own, and nothing else.

one line, six ROUNDS

Mechanically, a rehearsal is a reduce over scripted human lines. You hand Rehearsal.run/2 a list of "founder" lines; the agent side is a live model. For each line, the model completes — and if it wants tools, it calls them, the results are appended, and it completes again, up to @max_tool_rounds 6 rounds before it's forced to answer in plain words. One founder line, then, can chain several tool calls before the reply lands:

sequenceDiagram
  participant F as founder line (scripted)
  participant M as live model (persona)
  participant R as the real router
  participant L as the ledger
  F->>M: a scripted human line
  M->>R: round 1 — repo_state(summary)
  R->>L: read state
  L-->>M: grounded answer
  M->>R: round 2 — dispatch(goal)
  R->>L: file task gk-201
  L-->>M: dispatched
  Note over M: ≤ 6 rounds, then forced to answer
  M-->>F: plain reply — written to the transcript

Walk the exchange. A scripted line goes to the model. The model asks the real router for repo state and gets a grounded answer back from the ledger. It dispatches a background goal, the ledger files a task, the router confirms. After at most six such rounds, the model is forced to stop calling tools and answer in words — and that reply, plus every tool call along the way, is written into the transcript. One detail keeps it faithful to production: the assistant message sent back to the model API is stripped to just its role, content, and tool calls — the same echo convention the real agent loop uses. And if the model errors, the rehearsal doesn't crash; the turn simply reads (llm error: …), so a flaky model never costs you the whole run.

through the REAL router

depth rung · skippable — the thirteen lines that make it not a mock

Here is the seam itself. When the model calls a tool, the rehearsal does not reach into a fake — it builds an in-process HTTP request and hands it to the literal router module that serves production:

conn =
  Plug.Test.conn(:post, "/tool/#{name}", Jason.encode!(args))
  |> Plug.Conn.put_req_header("content-type", "application/json")
  |> Plug.Conn.put_req_header("x-gk-secret", secret)
  |> Workbooks.Groundskeeper.Router.call(Workbooks.Groundskeeper.Router.init([]))

Now set that beside what the deployed provider actually sends. The provisioned webhook tool posts to {base_url}/gk/tool/{name} with the header x-gk-secret and the args as a JSON body. Same route, same header, same body schema. Only the HTTP transport differs — the rehearsal's request is constructed in memory instead of arriving over the wire, and from there it is byte-for-byte the production path. So a single rehearsal exercises, for free, everything that path touches: the constant-time secret compare that fails closed with a 503 when the secret is unset, the JSON marshalling, the five real tool implementations, and the ledger writes. A mocked tool call skips all of it.

	a mocked tool call	the real-router call
auth header check	skipped — no router	runs — constant-time compare, fails closed
JSON marshalling	bypassed — you pass a map	encode in, decode out, as in prod
tool implementation	stubbed return value	the real five, with real side effects
the ledger	untouched	real reads and writes
what's faked	everything below the call	only the wire transport

The verdict of that table is the whole reason a rehearsal counts: a mock fakes everything below the call, while the real-router rehearsal fakes only the wire — auth, marshalling, the implementations, and the ledger all run exactly as production runs them.

one def, two CONSUMERS

A demo that drifts from production isn't a demo of production. So the persona isn't retyped for the rehearsal — it's read from the same definition file the live provisioning reads. Both the rehearsal and the provider-provisioning code extract the prompt from groundskeeper.org, from under the ** System prompt heading, and both raise if it's missing rather than running an agent with an empty brain:

flowchart TD
  def[["agent/groundskeeper.org
** System prompt — the single source"]]
  def --> prov["provisioning → the live deployed agent"]
  def --> reh["Rehearsal.run → the self-demo"]
  style def fill:#9fc4e8,stroke:#121316,stroke-width:2.5px
  style prov fill:#ffffff,stroke:#121316
  style reh fill:#aee5c2,stroke:#121316

One file, two arrows. The deployed agent and the rehearsal both pull their persona from the exact same heading in the exact same file — so the personality you prove in a rehearsal is, definitionally, the one production runs. The ** System prompt heading is load-bearing here: a def whose prompt doesn't live under it parses to nothing and the agent runs empty — a failure this project has actually shipped before, which is why both consumers raise loudly instead of falling through. The authoring lesson is where that convention lives.

the transcript is the ARTIFACT

A rehearsal writes itself down. The result lands in a datestamped org file under rehearsals/ — a first-class directory of the method, sitting beside the agent's sources and workflows. Each turn is a * founder heading and a * groundskeeper heading, and any tool calls follow as a list, each one showing the call and its result truncated to 200 characters. Here is one real turn, verbatim, from the first rehearsal that proved the groundskeeper — the founder asks for background research, the agent names task gk-201, then the tools fire:

* groundskeeper
  Sent it out. Task **gk-201** — researching E2B and Daytona pricing
  from their official pages, docs, and recent real-world mentions.
  I'll bring it back when it lands.
  tool calls:
   - =dispatch({"goal":"Research E2B (e2b.dev) and Daytona (daytona.io)
     pricing: …"})=
     → {"dispatched":true,"task":"gk-201",
        "workflow":"workflows/research-e2b-…-3.org"}
   - =capture({"kind":"idea","text":"The founder wants to sharpen the
     workbook-as-container pitch …"})=
     → {"captured":true,"file":"sources/captures/2026-06-11.org"}

Read what that proves. The first reply in the run was a grounded repo answer — the model called repo_state(summary) before asserting anything about the project, exactly as its tool description demands. This turn shows a real dispatch: the goal went to the runtime, task gk-201 was filed, and a workflow file was written — workflows/research-e2b-…-3.org, a six-leaf ordered org outline with :done-when: shell gates, which still exists in the repo. The captures were silent — saved, never read back, per the capture tool's instruction. And the run ends with honesty you can see in text: a later tasks({}) call returns "finished":null, and the agent says it's still working, no results back yet. No fabricated completion. That readback is observable, in plain text, forever.

The dispatched workflow persisting in the repo isn't a leak — it's the point. A rehearsal's side effects are real artifacts, which is exactly what makes it a demo and not a mock. The closest sibling page, dispatch, is where that spawn lane is taught in full.

where it sits on the LADDER

A rehearsal isn't the only rung — it's the middle one, and knowing its neighbors is how you read its result honestly. Below it sits ordinary CI: unit tests where every model and author call is injected, so there's no network at all — they cover auth failing closed, the capture file's shape, a crash routing to BLOCKED, the post-call HMAC check with a forged-but-valid signature. Above the rehearsal sits a rung that does use the provider: a server-side conversation simulation, which runs the agent's real deployed prompt and tool config through the provider's own text simulator — and needs the conversation key the rehearsal deliberately does not. At the very top is a live call with a human on the line.

rung	what's real	what's faked	passing proves
CI unit tests	the tool + auth code	model + author injected, no network	each piece works in isolation
rehearsal	live model + real router + real tools + ledger	only the telephony wire	the conversation layer holds, end to end
provider simulation	the deployed prompt + tool config, provider-side	the human voice	the provider drives the agent as configured
a live call	everything	nothing	it actually works on the phone

The verdict that table delivers: the rehearsal is the rung where a live model meets the real router with no provider in the loop — the most coverage you can get for the least ceremony, needing an ordinary model key and no voice key at all. It's how the groundskeeper's go-live readiness was proven while still blocked on the conversation key: shipped and proven by rehearsal, no human, no provider. Persona held. The agent went live afterward.

what a green rehearsal DOESN'T prove

Honesty section — because a test you over-read is worse than no test. Four limits, stated plainly:

The telephony hop stays untested. A rehearsal passing is not voice working. The proof of that is in the history: the missing-permission block on the conversation key was discovered by trying to go live, not by any rehearsal. The one hop you don't own is the one a rehearsal can't reach, by construction.

The side effects are real, and there is no dry-run. Captures land in sources/captures/, dispatches really run workflows and burn model calls, issues really file. This is a feature — it's a demo of the real thing — but it means a rehearsal dirties your repo and spends tokens every time. Run it knowing that.

The tool specs are hand-mirrored. The rehearsal keeps its own copy of the five tool specs, mirroring the provider provisioning's source of truth. The code's own comment admits it. That mirror can drift — if the provisioned specs change and the rehearsal's copy doesn't, you'd be rehearsing a slightly different agent than you ship.

"Persona held" is a judgment, not an assertion. A rehearsal is pass-or-fail by reading. Nothing in the run asserts that the tone was right or the answers were grounded — a human, or a judge, has to read the transcript and say so. The asserted sibling of a read-by-human transcript is a judged eval; a rehearsal is the document you'd hand the judge.

questions people actually ASK

Is this just mocking the conversation?

No — and that's the entire distinction. Nothing below the cut is mocked. The model is live, the router is the literal production module, the tools are the real five, the ledger really writes. The only thing absent is the provider's wire transport. A mock fakes the code under test; a rehearsal fakes only the network you don't own.

Why a live model instead of canned replies?

Because tool selection is the thing under test. The question a rehearsal answers is whether the model calls repo_state instead of guessing, whether it dispatches when the founder commits to work, whether it captures silently. Canned replies would assert the answers you're trying to discover.

Can I run one with no keys at all?

You need a model key — not the voice key. That's the point of the rung: it proves the conversation layer using ordinary inference, with no conversation provider provisioned. An unset router secret doesn't stop the run either; it just produces 503s inside the transcript, because failing closed is part of what gets rehearsed.

Does it dirty my repo?

Yes — deliberately. Captures, dispatched workflows, and ledger entries are real artifacts, and there's no dry-run mode. That's what makes it a demo of the real bridge rather than a simulation of it. Run it where real side effects are acceptable.

Is "rehearsal" a generic Workbooks verb?

No. It's a method, shown on one agent — the groundskeeper — in roughly a hundred and forty lines, invoked host-side as a maintainer op rather than shipped as a user command. The value of this page isn't a button to press; it's the recipe to copy for your un-CI-able agent — voice, SMS, Slack, anything provider-fronted.

So how do I actually run it?

Host-side, with a router secret and a list of scripted lines — no voice key needed:

WB_GK_SECRET=… mix run --no-start -e '
  Workbooks.Groundskeeper.Rehearsal.run([
    "Hey. Where are we with the project right now?",
    "Can you look into E2B and Daytona pricing in the background?",
    "What do you have running right now?"
  ]) |> IO.inspect()'
# => %{turns: [%{founder: "…", groundskeeper: "…", tool_calls: […]}, …],
#      file: "examples/groundwork/rehearsals/2026-06-11-045937.org"}

The model is WB_GK_BRAIN_MODEL or the runtime's default; every tool call hits the real router with the real secret.

keep GOING

Rehearsals are how you prove an agent — so the parent lesson is the place to start, and the neighbors fill in the pieces the recipe leans on.

Agentsthe construct a rehearsal proves

→

Dispatchthe spawn lane the gk-201 turn fires

→ ✓

Evalsthe judged sibling — asserted, not read

→ ✳

Authoringthe ** System prompt convention

→