spawning — the run that outlives the request

the run that outlives the REQUEST

Here is where every agent demo breaks. You ask for something that takes ten minutes — audit a repo, rewrite a directory, work a long plan — and the request/response model has no answer for you. The HTTP connection times out somewhere around thirty seconds. So you hold the socket open and pray the proxy, the load balancer, and the laptop lid all cooperate for ten straight minutes. They don't.

The escape hatch everyone reaches for is the same one: a job queue, a worker pool, a status table, a websocket relay. You rebuild Sidekiq around every agent — the identical plumbing, every project, just to answer the question did it finish yet? The parent lesson sold agents as workers hired for the long run. This page answers the unglamorous operational question underneath that pitch: what process, exactly, is the agent — and who keeps it alive after you hang up?

spawning, DEFINED

spawn·ing /ˈspɔːn·ɪŋ/ noun

1. starting a run as a supervised, named process: one call returns an id in milliseconds; the run works for minutes under a supervisor, addressable by that id from anywhere — including from itself.

Nothing here is a job framework bolted on. The supervisor, the named registry, the process-per-run — these are the BEAM's own primitives, the same ones that have run phone networks for thirty years. An agent run isn't a row in a queue table. It's a process with a supervisor watching it.

one call, an id, a 202

You start a run with one HTTP call. POST /api/run takes a body, mints an id, and returns before the run does any work:

$ curl -s -X POST $WB_RUNTIME_URL/api/run \
    -d '{"system":"You are a careful, capable agent.",
         "task":"audit blog/ for dead links and write report.org",
         "max_steps":40}'
{"id":"run-1742","status":"running"}            ← HTTP 202, milliseconds later

The id is run-<integer> — minted from a process-unique counter, not a UUID. The 202 is the whole point: it means accepted, working on it, not done. Your caller is free the instant it has the id. Three knobs ride in the body:

max_steps — the loop's ceiling. At this seam the default is 40. The raw agent default underneath is 12; standing workers tick at 60. Different front doors, different budgets.
model — which model drives the loop. Optional; falls through to the engine default.
exec — not a parameter so much as a trust grant. It unlocks host-brokered git, publish, image, and OS-workdir tools — never raw bash; that hatch was deleted on purpose. It's honored only when the desktop says so or WB_AGENT_EXEC=1 is set in the environment. Ask for it without the grant and you simply don't get those tools.

The CLI is the same call wearing a friendlier face: wbx agent run "audit blog/ for dead links" --model openrouter/… posts to this endpoint and prints the id. There is no second code path — the CLI is a client of /api/run, exactly like your curl.

sequenceDiagram
  participant C as caller (curl / wbx)
  participant W as web.ex
  participant S as AgentSession.Sup
(DynamicSupervisor)
  participant R as the session process
  C->>W: POST /api/run {system, task, max_steps}
  W->>W: mint id run-1742
  W->>S: start_child(session for run-1742)
  S->>R: spawn + register by id
  W-->>C: 202 {"id":"run-1742","status":"running"}
  Note over R: the run hasn't started yet —
the 202 is already on its way back
  R->>R: NOW the work begins

the reply leaves before the work begins — milliseconds, not minutes

the receptionist and the WORKER

Here is the load-bearing trick. A run is not one process — it's two, and the split is what makes everything else trivial.

The first process is a supervised GenServer, the AgentSession. Think of it as a receptionist. It's started under a DynamicSupervisor, registered by its id in a registry, and its entire job is to stay responsive. It answers what's your status?, subscribe me to updates, here's a human review — instantly, always, because it never does the slow work itself.

The slow work belongs to a second, unnamed process. When the session boots, its handle_continue spawns a plain child process that calls Agent.run and grinds through the model-and-tools loop for however many minutes it takes. The receptionist holds a handle to it and keeps taking calls. That's why GET /api/run/:id never blocks waiting on the model: you're talking to the receptionist, not the worker. The worker could be ten seconds into a slow tool call and the status answer still returns instantly.

Lookup is by registry. Hand any of these APIs an id that isn't registered and you get a clean :not_found back — no crash, no guessing. And both the registry and the supervisor are permanent children of the application's supervision tree, so the spawning machinery is up and waiting before any HTTP listener accepts a connection.

flowchart TD
  app[["Application supervisor"]]
  app --> reg["AgentSession.Registry
id → pid"]
  app --> sup["AgentSession.Sup
DynamicSupervisor"]
  sup --> s1["session run-1742
(receptionist · GenServer)"]
  sup --> s2["session run-1743
(receptionist · GenServer)"]
  s1 -. "spawn (unlinked)" .-> w1["worker — Agent.run loop
(does the minutes of work)"]
  s2 -. "spawn (unlinked)" .-> w2["worker — Agent.run loop"]
  style app fill:#9fc4e8,stroke:#121316,stroke-width:2.5px
  style s1 fill:#13d943,stroke:#121316,stroke-width:2px
  style w1 fill:#ffffff,stroke:#121316

one session per run, registered by id; the worker hangs off it, doing the loop

When the worker finishes, it sends the session a done message. The session's status flips from :running to :done and every subscriber gets the result. Until then, the session's state carries everything anyone could ask for: its id, status, the run, the live event list, its subscribers, and any pending reviews.

every step, three SINKS

Each time the loop calls a tool, it produces a step event — one consistent shape, the same one the whole engine speaks:

%{step, agent, tool, args, output, exit_code, error, dur_ms, ts}

That single event fans out to three different places, each with its own job and its own truncation budget. The truncation isn't sloppiness — it's a deliberate gradient, generous where the data is durable and tight where it's ephemeral:

sink	what it's for	output limit	survives engine restart?
on_step → WebSocket	live UI fanout to every subscriber	500 chars / frame	no — dies with the socket
`_steps.jsonl`	the append-only ledger, written regardless of any caller	200 chars / line	yes — it's a file
`events.org`	the readable run transcript, rendered on finish	300 chars / step	only if the VFS was file-backed

The middle row is the important one. _steps.jsonl is written on every step regardless of any caller-supplied on_step — so nothing escapes by construction. You can ignore the websocket entirely and the provenance trail still exists on disk. The full event holds up to 4000 chars of output; what each sink keeps is a deliberate slice of that. Generous at the source, tight at the edges.

One more guard lives at this layer: every tool call is wall-clock bounded at 150 seconds. A tool that wedges gets shut down and becomes a tool error the model sees — never a stalled run. The loop keeps going; the model decides what to do about the failure.

poll or STREAM

Because the receptionist was accumulating every step all along, watching a run is just a registry lookup. Two ways to do it.

Poll with GET /api/run/:id. Mid-run you get a running snapshot; minutes later, a finished one:

$ curl -s $WB_RUNTIME_URL/api/run/run-1742      ← mid-run
{"status":"running","steps":7,"live":[…],"reviews":[]}

$ curl -s $WB_RUNTIME_URL/api/run/run-1742      ← minutes later
{"status":"done","steps":12,"result":"…",
 "tools":["shell","fetch","vfs_write","done"],
 "events_org":"* Agent run  :session:\n** step 0: shell  :tool_call:…",
 "reviews":[]}

The done payload carries the distinct tool names used, the result, and the whole events.org transcript inline. One honest quirk to script around: an unknown id here returns HTTP 200 with {"error":"no such run"} — not a 404. Check the body, not just the status code.

Stream with GET /api/run/:id/stream, a WebSocket upgrade with a 10-minute idle timeout. On connect it subscribes you, then pushes a frame per step, then a final done frame:

{"type":"subscribed","id":"run-1742"}
{"type":"step","step":0,"tool":"shell","output":"blog/2026-05-01.org\nblog/…"}
{"type":"step","step":1,"tool":"fetch","output":"fetch failed: HTTP 404"}
{"type":"done","result":"3 dead links found; report.org written"}

Those output values are sliced to 500 chars per frame — enough to watch the run think, not the full payload. A bad id over the socket gets {"type":"error","error":"no such run"}.

sequenceDiagram
  participant U as UI (WS client)
  participant A as AgentStream
  participant S as session run-1742
  U->>A: GET /api/run/run-1742/stream (upgrade)
  A->>S: subscribe
  A-->>U: {"type":"subscribed","id":"run-1742"}
  S-->>A: step 0 (shell)
  A-->>U: {"type":"step","step":0,"tool":"shell", …}
  S-->>A: step 1 (fetch)
  A-->>U: {"type":"step","step":1,"tool":"fetch", …}
  S-->>A: run done
  A-->>U: {"type":"done","result":"3 dead links found …"}

scripts poll; UIs stream — both read the same accumulating session

The rule of thumb: a script that wants a final answer polls; a UI that wants to show the run thinking streams. Same data, two doors.

the run that knows its NAME

Here's the move that turns spawning into a building block instead of a convenience. When the runtime spawns a run, it injects WB_RUN=<run id> into the agent's environment. The run knows its own name.

That sounds small until you see what it enables: a run can build a URL that routes back into its own mailbox. A connect URL like …/api/ctk/commit?run=$WB_RUN lets the run hand a human a place to send a decision, then call wb ctk await $WB_RUN and block until it arrives. The review lands via POST /api/ctk/commit?run=<id>, gets pushed live to subscribers and queued FIFO, and the agent polls for it — a 204 when there's nothing yet. Reviews even persist to a per-run JSONL file, so the decision record survives the process. This is the primitive under human-in-the-loop, and it has its own deep dive.

→ The full review loop is the human-in-the-loop lesson. This is just the seam it stands on: a run that can address itself.

the other spawner: born at BOOT

Everything so far described runs born on demand — one POST /api/run, one process under the DynamicSupervisor. There's a second spawner with the same engine inside. Standing agents are born at boot from a declared manifest, under a static supervisor.

You declare them in a small org manifest. Headings are agent names; properties point at definitions and tune cadence:

#+TITLE: crew
* wren
  :PROPERTIES:
  :DEF: /data/agents/writer.org
  :LIFECYCLE: /data/lifecycles/writer.org
  :INTERVAL: 10m
  :END:
* moss
  :PROPERTIES:
  :DEF: /data/agents/editor.org
  :END:

Point WB_CREW_DEF at that file and the supervisor starts one worker per heading. Walk the math the manifest implies: wren first ticks at boot-grace plus zero; moss starts staggered by i × WB_CREW_STAGGER_MS — 30 seconds later by default — so two agents don't slam the engine in the same instant. Wren has a 10-minute interval; moss declared none, so it defaults to once an hour. Both runs queue on a global Gate that caps concurrency at 2; a worker acquires before its run and releases in an after block, so a wedged run can't starve its peers. Each run executes in a linked task killed at 15 minutes wall clock. And cadence survives restarts: a keeper-last-run-wren file under WB_DATA means a reboot doesn't reset the clock to zero.

	ad-hoc run	standing agent
spawner	DynamicSupervisor	static Supervisor (one per member)
born	on demand — a POST	at boot, staggered (i × 30s)
trigger	`POST /api/run` / `wbx agent run`	interval cadence from the manifest
concurrency	unbounded by design	Gate — max 2 at once
per-run clock	max_steps + 150s/tool	15-minute wall clock per tick
interior	the same `Agent.run` loop, either way

two front doors, one engine — the standing agent reaches the same loop your curl did

The punchline is the last row. The keeper tick goes through the same path as /api/run — same loop, same event shape, same gates — just with its workdir set to a tenant repo and the exec grant on. Idle agents back off too: a streak of no-work runs grows the gap exponentially, capped at 30 minutes, and a single completed run snaps it back. The orchestration and fleets lessons live in this manifest in full; this is the trailhead.

bounded at every LAYER

A run could hang in a dozen places — a tool that never returns, a loop that never converges, a socket nobody closes. The design's answer is a ladder of time bounds, each one converting a hang into a signal the next layer up can act on:

bound	value	who enforces it	what a hang becomes
tool call	150 s	the agent loop	a tool error the model sees
loop length	max_steps (40 / 12 / 60)	the run itself	a clean finish, not an infinite loop
keeper tick	15 min	the standing worker	a killed task, peers freed
WS idle	10 min	the stream socket	a closed connection, run untouched
no-work backoff	→ 30 min cap	the standing worker	an idle agent that stops burning calls

Read the table's verdict in one line: every bound turns a stall into a fact something else can handle. A wedged tool doesn't stall the run — it becomes an error the model routes around. A streamed socket going idle doesn't keep the run hostage — it just closes, and the run, which never cared about your socket, keeps working. These aren't safety nets bolted on after the fact. They're the shape of the thing.

what dies, what SURVIVES

The honest limits, stated plainly, because they decide how you script against this.

By default a run leaves nothing behind. The spawned loop opens its own VFS, and the default is :memory:. The moduledoc's promise of a resumable, replicable run is true only when the caller passes a durable :vfs path. No path, no residue from the VFS.

Engine restart kills every in-flight run. Sessions are supervised children, not persisted across a BEAM restart. What survives a restart is the durable residue: _steps.jsonl, the per-run review JSONL, and events.org if the VFS was file-backed. The live sessions themselves are gone.

A hard-crashed worker leaves a zombie. The worker process is a plain spawn — unlinked, unmonitored. If it crashes outright (as opposed to an LLM error, which the loop converts into a clean error: … finish), the session never hears the done message and shows status: running indefinitely. We didn't find a sweeper that reaps these, so treat a run stuck on running past your timeout budget as suspect, not as still-working.

A session crash re-runs, it doesn't resume. The session GenServer uses OTP's default permanent restart. If the session itself crashes, the supervisor restarts it with the same id and task — and handle_continue runs the whole task again from step zero. Subscribers and live history are lost. This is inferred from the OTP defaults, not an explicit choice in the code, but it's the behavior to expect.

None of this is hidden. The truncation tiers are real, the 200-not-404 quirk is real, and the durable trail is exactly three files. Build on what's durable; don't assume the live process is.

questions people actually ASK

Does the run survive my laptop closing?

Yes — if the engine is somewhere else (your deployed runtime on Fly, say). The caller is disposable by design; closing your laptop just drops the client that held the id. Reconnect later and poll the id. If the engine is your laptop, then closing it stops the engine, and the run with it.

Can I cancel a run?

Honestly: we found no kill endpoint in the surface this lesson covers. The brakes are max_steps and the time bounds — 150s per tool, the loop ceiling, the 15-minute keeper clock. If you need a hard cancel, verify the current API before relying on one; don't assume it exists because it feels like it should.

How many runs can go at once?

Ad-hoc runs are unbounded by design — each POST /api/run is its own supervised process, and the DynamicSupervisor doesn't cap them. Standing agents are different: the Gate caps them at 2 concurrent by default (WB_CREW_MAX_CONCURRENT), precisely so a fleet of them doesn't overwhelm the engine.

Why did my run say running forever?

Almost certainly the zombie case: the unlinked worker crashed hard, so the session never got its done message. An LLM error wouldn't do this — those get converted to a clean finish. A genuine process crash leaves the status stuck. Past your timeout budget, treat running as suspect.

Should I poll or stream?

Poll from a script that wants the final answer — one GET /api/run/:id when you expect it's done, and read the result and tools off the payload. Stream from a UI that wants to show the run thinking, frame by frame. Both read the same session; the choice is about your client, not the run.

Is this a job queue under the hood?

No — and the distinction matters. There's no queue table, no worker pool you provision, no relay you stand up. It's a supervised process per run, named in a registry, on the BEAM. The thing that runs phone networks runs your agent.

keep GOING

Spawning is the mechanics under the parent lesson. From here, go inward to what the process actually does, or outward to who calls spawn on a cadence.

Agentsthe parent — what a run is for

→ ↻

Loopsinside the spawned process

→ ⊹

Orchestrationwho calls spawn on a cadence

→ ⮐

Human-in-the-loopthe run that awaits you

→