twenty minutes of WHAT, exactly
"Agent" is the most hand-waved word in software. The parent lesson sold tenure — a worker with files, a schedule, memory, hired for outcomes. Fine. But when that worker "works for twenty minutes," what is actually executing? Why doesn't it hang forever? And when it does something strange at two in the morning, what can you actually read back?
In most agent frameworks the answer is a shrug wrapped in a diagram: an opaque loop somewhere inside a planner, a graph engine, a pile of abstractions you don't own. Here the answer is shorter and less flattering to the word "agent." A run is one recursive function — about fifty lines — and this page opens the cover on a single shift of it. No diagram for this section, because the honest version of the picture is just the next few sections, read in order.
the run, DEFINED
1. one trip around model → tools → append,
repeated until the model stops asking for tools, signals done, or
spends its step budget — returning
%{result, steps, events, log}.
One word in that definition is load-bearing: step. A step is one model turn that called tools. If the model asks for three tools in a single turn, that's three tool calls sharing one step number — the counter bumps once per trip around the loop, not once per tool. So "twelve steps" means twelve times the model was consulted, not twelve things that happened. Hold that distinction; the whole trace reads off it.
one trip around the LOOP
Here is the entire machine. The run builds its state, seeds the
transcript with two messages — the system prompt and the user's task — and
then calls loop/2 on it. Each pass does exactly one thing: ask
the model, and branch on the answer.
flowchart TD seed["seed transcript:
[system, task]"] --> ask["ask the model
(one LLM turn)"] ask --> q{"did it call
tools?"} q -- "no tool calls" --> fin["finish — return the
model's text as result"] q -- "tool calls" --> run["run every tool call
(150s guillotine each)"] run --> done{"a done tool, or
an error?"} done -- "done tool" --> fin2["finish — return
done's result"] done -- "neither" --> app["append [assistant | tool results]
step + 1"] app --> cap{"step ≥ max?"} cap -- "yes" --> stop["finish — stopped:
reached max_steps"] cap -- "no" --> ask err["LLM error"] -. "any turn" .-> fin3["finish — error: ...
(a string, never a crash)"] style ask fill:#9fc4e8,stroke:#121316,stroke-width:2.5px style fin fill:#13d943,stroke:#121316 style fin2 fill:#13d943,stroke:#121316 style stop fill:#f3c5a3,stroke:#121316 style fin3 fill:#f3c5a3,stroke:#121316
Trace the exits. There are two clean ones and two with a string attached.
The model stops calling tools — its text content becomes the result,
the run is over. The model calls the done tool — its exec
returns a non-nil value that threads up through tool execution and
short-circuits the loop with that value. Those are the two ways a run
succeeds. The other two are guards: the step counter hits
max and the run finishes with the literal string
"stopped: reached max_steps (N)"; or the model call errors
and the run finishes with "error: ...". That last one is the
loop's whole philosophy in one clause — a run never crashes out; failure
becomes the result string. No planner, no graph engine, no framework. The
transcript is the state, and this is all of it.
what the model actually SEES
Depth rung — skip it if the cycle was enough. The reason there's no
hidden state is that the transcript is the state, and it's a flat
list of messages you could print. It starts as [system, task].
Every model turn appends an assistant message; every tool the model called
appends its output back as a role:"tool" message. The next turn
sees the whole history — that's how the model "remembers" what its last tool
call returned. It doesn't; the loop hands it back.
sequenceDiagram participant L as loop/2 participant M as OpenRouter (the model) participant T as tools L->>M: messages + tool specs M-->>L: assistant msg + tool_calls L->>T: run each call T-->>L: outputs (≤4000 chars each) Note over L: append [assistant | tool results]
step + 1 L->>M: the longer transcript, again M-->>L: tool_calls: [] + text Note over L: no calls → finish
Two housekeeping facts keep that transcript honest. Assistant messages are
stripped to only role, content, and
tool_calls before they're kept — nothing else the provider
returned rides along. And tool outputs are truncated to 4000 characters
on the way in, so a chatty tool can't blow out the context. The model turn
itself is a single call to OpenRouter's OpenAI-compatible
chat/completions endpoint; the default model is xiaomi/mimo-v2.5,
overridable per-run or with WB_LLM_MODEL, at temperature 0.4.
The API key lives host-side in OPENROUTER_API_KEY — the agent
never sees it. Secrets are held by reference, not handed into the loop.
nothing waits FOREVER
This is the section that earns the word "engine." Every tool call runs
inside a killable task: Task.async to start it, then
Task.yield(task, 150_000) || Task.shutdown(task, :brutal_kill).
In English: the tool gets 150 seconds, and if it doesn't return, it's killed.
And here's the part that makes it not just a timeout — when a tool is killed,
the model's next context contains, as an ordinary tool result:
tool error: git timed out after 150s (killed)
A timeout is not an exception that unwinds the run. It's a string the model reads on its next turn and reacts to — retry, work around, file an issue. The run never stalls; the loop keeps its rhythm even when a single tool dies.
Then there's the war story. The LLM call itself was also observed
to hang — runs stalled ten-plus minutes inside one completion, because the
HTTP client's own 120-second timeout was watched and seen not to fire.
So the model turn sits inside a second, outer kill-bound: per-request 120s,
up to two retries on the usual transient codes, and a hard outer deadline of
(retries + 1) × 120s + 15s = 375 seconds, after which the
whole completion task is killed and returns {:error, :llm_hard_timeout}.
Two guillotines, one inside the other.
The third bound is the step budget. Here is every bound in one place, with what the model sees when it breaches:
| bound | constant | what it limits | on breach |
|---|---|---|---|
| tool call | 150 s, brutal-kill | any single tool | a tool error: ... timed out result the model reads |
| LLM turn | 120 s/req · 2 retries · 375 s outer | one model completion | {:error, :llm_hard_timeout} → finishes the run with a string |
| max_steps | default 12 | trips around the loop | "stopped: reached max_steps (N)" as the result |
| fetch | 20 s | one GET | fetch returns its own error string |
| image | 2 per run | generations attempted | "image budget exhausted (2/run)..." |
On max_steps: the default is 12, but treat that as a floor
nobody experiences. Every real surface raises it — the HTTP /api/run
endpoint defaults to 40, the keeper worker to 60, todo-dispatch to 60, the
autopoet to 80, brandnana's ask endpoint to 250, and a workflow agent-component
to a deliberately tight 6. The budget is a property of the caller, not of the
loop.
nine tools, and three you EARN
Every agent gets the same nine base tools. None of them is a way to run an
arbitrary command on the host — that hatch was removed on purpose (next
section). Three more tools — git, publish,
image — are host-brokered and granted only to trusted
(exec) agents: the agent supplies intent, the host runs a fixed
operation. done is a tool too, not magic; calling it is how the
model says "I'm finished, here's the result."
| tool | the agent supplies | what the host does | bound / note |
|---|---|---|---|
| shell | a pipeline string | runs it in the in-WASM pipe shell over coreutils + jq/grep | no OS process; wasmtime per stage |
| search | a query | semantic recall over the workdir's own files (top 5) | the files are the memory |
| fetch | a URL | GET, strips HTML to text | 20 s · truncated to 4000 chars |
| web_search | a query | host-brokered keyless SERP, ≤8 results | title · url · snippet |
| wb | CLI args | runs the wb CLI in-process | vars, toolkit list/show/run |
| file_issue | title · need · tried | files a metacognitive ticket, tells the agent to carry on | the wall-hit escape valve |
| vfs_read / vfs_write | a path · content | shared OS workdir (exec) or per-run in-memory VFS (non-exec) | path must stay inside the workdir |
| done | a result | short-circuits the loop with that result | the clean exit |
| — exec grant below — | |||
| git | only a commit message | host runs commit_and_push(workdir, msg, tenant) | non-exec → "git not permitted" |
| publish | nothing — the intent | copies content/** + blog/** to the public web root | same permission gate |
| image | a prompt · dest path | host-held image lane, path-traversal guarded | 2/run, counted at attempt |
Two of those rows hide a real story. shell is the in-WASM pipe
shell — it speaks |, ; && || with
short-circuit, variables, and redirection confined to preopened dirs, but
there is no OS process behind it; each stage is a wasm instance. And it
treats 2>/dev/null, 2>&1, any
/dev/* redirect as a silent no-op rather than an error — which
sounds trivial until you learn it was once the single biggest source of a
production agent's per-run thrash. The image budget of two is
spent on attempt, not on success: a failed generation still burns a
slot, with the message "image budget exhausted (2/run) — plan banners,
don't spray".
brokered, not BESTOWED
Depth rung. "The host runs git for you" sounds like native shell access
with extra steps. It isn't, and the difference is the whole trust model: the
agent can never choose a command line. For git it supplies
a commit message and nothing else — the host decides it's
commit_and_push against this workdir for this
tenant. There is no string the agent can write that becomes
rm -rf on the host, because there is no place to write it.
This is enforced, not promised. The old native-exec hatch (a generic
run tool, an OS sandbox wrapper) was deleted, and a test stands
guard so it can't quietly return:
test "no run tool on either surface" do refute "run" in tool_names(base_agent) refute "run" in tool_names(exec_agent) # Workbooks.Sandbox (native bwrap/seatbelt) is gone; # native-compiler fallbacks must error, not exec. end
The other half of trust is containment. Even a trusted agent's
vfs_read and vfs_write paths must resolve strictly
inside its workdir; a path that escapes returns
"write blocked: path escapes your working dir". Without that guard
an exec agent could read /etc/passwd, the host's own
.ex files, or another tenant's repo — the hole that would make
any "confined to the config layer" claim a lie. Non-exec agents that try
git get "git not permitted (no exec capability)" —
and note that's a tool result the model reads, not an HTTP 403 thrown
at a caller. The permission is part of the conversation.
the run narrates itself, THREE ways
You should never have to guess what a run did. By construction it leaves three records, at three latencies — and the middle one is written whether the caller asked for it or not.
flowchart TD step["one step event
{step, agent, tool, args, output,
exit_code, error, dur_ms, ts}"] step --> live["live — on_step callback →
WS frames: {type:step,...} {type:done}"] step --> jsonl["always-on — _steps.jsonl
one JSON line/step, output ≤200ch"] step --> org["at finish — events.org
one headline/step, OQL-queryable"] jsonl --> ledger["the signed ledger
(hash-chains the raw lines)"] jsonl --> dreams["the dream phase
(digests it)"] jsonl --> wire["public activity wire
(grouped by agent tag)"] style step fill:#9fc4e8,stroke:#121316,stroke-width:2.5px style jsonl fill:#aee5c2,stroke:#121316
Read the fan-out. Each step produces one event with real fields —
step, agent (which agent, when there is one),
tool, args, output,
exit_code, error, dur_ms (monotonic),
and ts. That event flows to three sinks. Live: an
on_step callback fans out to WebSocket subscribers, so
GET /api/run/:id/stream pushes {type:"step"} frames
then a final {type:"done"}. Always-on:
log_step appends one JSON line per step to
<workdir>/_steps.jsonl — lock-free, output clipped to 200
chars, "regardless of any caller-supplied callback, so nothing escapes by
construction"; even the write failing is swallowed rather than disrupting the
run. At finish: event_log renders the events into an
org-mode document, events.org.
That middle layer is load-bearing, not decorative. The
signed ledger hash-chains the raw jsonl
lines and signs the head with the tenant's key; the
dream phase digests the same file; the public
live-activity wire groups it by the per-step agent tag. The raw
trace is the substrate everything downstream reads. (It's also on the
never-leaves-home privacy list, alongside the ledger and telemetry db.)
Here's one real step, both ways. The _steps.jsonl line —
the during-the-run view:
{"step":3,"agent":"waldo","tool":"shell","args":{"pipeline":"cat data.json | jq .users[].name | sort -u"},"output":"ada\ngrace\n","exit_code":0,"error":null,"dur_ms":412,"ts":1765532191}
And the same step at finish, rendered into events.org — a
headline tagged :tool_call:, an :ARGS: drawer
carrying the JSON, then the output:
* Agent run :session:
** step 3: shell :tool_call:
:PROPERTIES:
:ARGS: {"pipeline":"cat data.json | jq .users[].name | sort -u"}
:END:
ada grace
* Result
published the roster page
Note the truncations as you read across the layers: 200 chars in the
jsonl, 300 in the org body, 4000 in the transcript the model saw. The trace
is a faithful skeleton, not a byte-perfect recording — and the
agent field stamps every line with which agent did the
work.
every failure becomes TEXT
Depth rung — the loop's failure philosophy gathered in one place. Nothing here throws the run off the rails; each failure lands in the transcript as a string the model reads on its next turn and (later) you read in the trace. A run always terminates with a result.
| what went wrong | what the model gets back |
|---|---|
| malformed tool args | shell error: required arg `pipeline` missing or not a string |
| a tool ran past 150 s | tool error: <name> timed out after 150s (killed) |
| image budget spent | image budget exhausted (2/run) — plan banners, don't spray |
| no exec capability | git not permitted (no exec capability) |
| path escapes workdir | write blocked: path escapes your working dir |
| the LLM call errored | run finishes: error: ... |
| ran out of steps | run finishes: stopped: reached max_steps (N) |
And the most interesting failure isn't a failure of the loop — it's the
agent admitting it hit a wall. file_issue is a tool: when an
agent's equipment falls short ("my toolkit lacks a verb for this"), it files
a metacognitive ticket and the reply tells it to note it and carry on rather
than stall. The wall becomes a tool call, the run keeps moving, and the
autopoet picks the ticket up later. Hitting a limit is,
itself, just another line in the transcript.
HONESTY
What this machine is, and what it deliberately isn't.
- A step budget is a budget, not judgment. A run that hits
max_stepsstops mid-thought with a literal"stopped: reached max_steps"— it doesn't summarize, doesn't wrap up gracefully. The number is a leash, not intelligence. - 150 seconds kills slow-but-honest work too. A legitimately long-running tool gets the same guillotine as a hung one. The bound buys you a run that can't wedge; it costs you the occasional good tool that needed three minutes.
- Truncation is everywhere — 4000 chars in the transcript, ~500 in
a live frame, 300 in
events.org, 200 in_steps.jsonl. The trace records faithfully what happened, not every byte, and not why. events.orgonly exists at finish. While a run is live, the during-the-run view is_steps.jsonl(and the WS stream). The narrative document is written when the loop returns.- Trust is binary, and host-granted. An agent is exec or it isn't; it never acquires the grant itself. The capability flows down from the host, never up from the agent.
And the framing this all serves: this is observability for people building with agents — a trace you can read, a run that can't run away from you. It is never a pitch about software that runs itself. The loop makes the work legible and bounded. It doesn't make the human optional.
questions people actually ASK
At max_steps, does it summarize first?
No. The guard clause fires the moment step ≥ max and the run
finishes with the literal string "stopped: reached max_steps (N)".
There's no wrap-up turn. If you want a clean ending, give the budget room or
prompt the agent to call done when it's near the edge.
Can I watch a run live?
Yes. POST /api/run returns 202 immediately with
{id, status:"running"} — the run continues in a supervised
process that outlives the caller. Then GET /api/run/:id/stream
is a WebSocket of {type:"step"} frames ending in
{type:"done"}, or you can poll GET /api/run/:id
for %{status, steps, result, tools, events_org}.
Can the agent lie in its own trace?
_steps.jsonl is written by the host, per step, not by the
model — the agent doesn't author it. Whether someone can tamper
with it after the fact is a different question, and the answer is the
ledger: it hash-chains the raw lines and signs the
head, so editing history is detectable.
What model runs the loop?
Any OpenRouter chat model. The default is xiaomi/mimo-v2.5;
a per-agent definition sets a :MODEL: property, and
WB_LLM_MODEL overrides globally. The loop is model-agnostic —
it only needs a turn that can emit tool calls.
Can the model call several tools in one step?
Yes — and they share one step number. Tool execution reduces over all the calls in a turn without bumping the counter; the bump happens once per trip around the loop. So a step can carry one tool call or five.
Is exec honored over HTTP?
Only on desktop, or when WB_AGENT_EXEC=1 is set — never for
arbitrary multi-tenant callers. A request asking for exec:true
against a shared endpoint gets a plain non-exec run.
keep GOING
This was one shift under the microscope. The parent frames the whole job; the live siblings show where the trace and the wall-hits go next.