loops — what one agent run actually is

twenty minutes of WHAT, exactly

"Agent" is the most hand-waved word in software. The parent lesson sold tenure — a worker with files, a schedule, memory, hired for outcomes. Fine. But when that worker "works for twenty minutes," what is actually executing? Why doesn't it hang forever? And when it does something strange at two in the morning, what can you actually read back?

In most agent frameworks the answer is a shrug wrapped in a diagram: an opaque loop somewhere inside a planner, a graph engine, a pile of abstractions you don't own. Here the answer is shorter and less flattering to the word "agent." A run is one recursive function — about fifty lines — and this page opens the cover on a single shift of it. No diagram for this section, because the honest version of the picture is just the next few sections, read in order.

the run, DEFINED

run /rʌn/ noun

1. one trip around model → tools → append, repeated until the model stops asking for tools, signals done, or spends its step budget — returning %{result, steps, events, log}.

One word in that definition is load-bearing: step. A step is one model turn that called tools. If the model asks for three tools in a single turn, that's three tool calls sharing one step number — the counter bumps once per trip around the loop, not once per tool. So "twelve steps" means twelve times the model was consulted, not twelve things that happened. Hold that distinction; the whole trace reads off it.

one trip around the LOOP

Here is the entire machine. The run builds its state, seeds the transcript with two messages — the system prompt and the user's task — and then calls loop/2 on it. Each pass does exactly one thing: ask the model, and branch on the answer.

flowchart TD
  seed["seed transcript:
[system, task]"] --> ask["ask the model
(one LLM turn)"]
  ask --> q{"did it call
tools?"}
  q -- "no tool calls" --> fin["finish — return the
model's text as result"]
  q -- "tool calls" --> run["run every tool call
(150s guillotine each)"]
  run --> done{"a done tool, or
an error?"}
  done -- "done tool" --> fin2["finish — return
done's result"]
  done -- "neither" --> app["append [assistant | tool results]
step + 1"]
  app --> cap{"step ≥ max?"}
  cap -- "yes" --> stop["finish — stopped:
reached max_steps"]
  cap -- "no" --> ask
  err["LLM error"] -. "any turn" .-> fin3["finish — error: ...
(a string, never a crash)"]
  style ask fill:#9fc4e8,stroke:#121316,stroke-width:2.5px
  style fin fill:#13d943,stroke:#121316
  style fin2 fill:#13d943,stroke:#121316
  style stop fill:#f3c5a3,stroke:#121316
  style fin3 fill:#f3c5a3,stroke:#121316

Trace the exits. There are two clean ones and two with a string attached. The model stops calling tools — its text content becomes the result, the run is over. The model calls the done tool — its exec returns a non-nil value that threads up through tool execution and short-circuits the loop with that value. Those are the two ways a run succeeds. The other two are guards: the step counter hits max and the run finishes with the literal string "stopped: reached max_steps (N)"; or the model call errors and the run finishes with "error: ...". That last one is the loop's whole philosophy in one clause — a run never crashes out; failure becomes the result string. No planner, no graph engine, no framework. The transcript is the state, and this is all of it.

what the model actually SEES

Depth rung — skip it if the cycle was enough. The reason there's no hidden state is that the transcript is the state, and it's a flat list of messages you could print. It starts as [system, task]. Every model turn appends an assistant message; every tool the model called appends its output back as a role:"tool" message. The next turn sees the whole history — that's how the model "remembers" what its last tool call returned. It doesn't; the loop hands it back.

sequenceDiagram
  participant L as loop/2
  participant M as OpenRouter (the model)
  participant T as tools
  L->>M: messages + tool specs
  M-->>L: assistant msg + tool_calls
  L->>T: run each call
  T-->>L: outputs (≤4000 chars each)
  Note over L: append [assistant | tool results]
step + 1
  L->>M: the longer transcript, again
  M-->>L: tool_calls: [] + text
  Note over L: no calls → finish

Two housekeeping facts keep that transcript honest. Assistant messages are stripped to only role, content, and tool_calls before they're kept — nothing else the provider returned rides along. And tool outputs are truncated to 4000 characters on the way in, so a chatty tool can't blow out the context. The model turn itself is a single call to OpenRouter's OpenAI-compatible chat/completions endpoint; the default model is xiaomi/mimo-v2.5, overridable per-run or with WB_LLM_MODEL, at temperature 0.4. The API key lives host-side in OPENROUTER_API_KEY — the agent never sees it. Secrets are held by reference, not handed into the loop.

nothing waits FOREVER

This is the section that earns the word "engine." Every tool call runs inside a killable task: Task.async to start it, then Task.yield(task, 150_000) || Task.shutdown(task, :brutal_kill). In English: the tool gets 150 seconds, and if it doesn't return, it's killed. And here's the part that makes it not just a timeout — when a tool is killed, the model's next context contains, as an ordinary tool result:

tool error: git timed out after 150s (killed)

A timeout is not an exception that unwinds the run. It's a string the model reads on its next turn and reacts to — retry, work around, file an issue. The run never stalls; the loop keeps its rhythm even when a single tool dies.

Then there's the war story. The LLM call itself was also observed to hang — runs stalled ten-plus minutes inside one completion, because the HTTP client's own 120-second timeout was watched and seen not to fire. So the model turn sits inside a second, outer kill-bound: per-request 120s, up to two retries on the usual transient codes, and a hard outer deadline of (retries + 1) × 120s + 15s = 375 seconds, after which the whole completion task is killed and returns {:error, :llm_hard_timeout}. Two guillotines, one inside the other.

The third bound is the step budget. Here is every bound in one place, with what the model sees when it breaches:

bound	constant	what it limits	on breach
tool call	150 s, brutal-kill	any single tool	a `tool error: ... timed out` result the model reads
LLM turn	120 s/req · 2 retries · 375 s outer	one model completion	`{:error, :llm_hard_timeout}` → finishes the run with a string
max_steps	default 12	trips around the loop	`"stopped: reached max_steps (N)"` as the result
fetch	20 s	one GET	fetch returns its own error string
image	2 per run	generations attempted	`"image budget exhausted (2/run)..."`

On max_steps: the default is 12, but treat that as a floor nobody experiences. Every real surface raises it — the HTTP /api/run endpoint defaults to 40, the keeper worker to 60, todo-dispatch to 60, the autopoet to 80, brandnana's ask endpoint to 250, and a workflow agent-component to a deliberately tight 6. The budget is a property of the caller, not of the loop.

nine tools, and three you EARN

Every agent gets the same nine base tools. None of them is a way to run an arbitrary command on the host — that hatch was removed on purpose (next section). Three more tools — git, publish, image — are host-brokered and granted only to trusted (exec) agents: the agent supplies intent, the host runs a fixed operation. done is a tool too, not magic; calling it is how the model says "I'm finished, here's the result."

tool	the agent supplies	what the host does	bound / note
shell	a pipeline string	runs it in the in-WASM pipe shell over coreutils + jq/grep	no OS process; wasmtime per stage
search	a query	semantic recall over the workdir's own files (top 5)	the files are the memory
fetch	a URL	GET, strips HTML to text	20 s · truncated to 4000 chars
web_search	a query	host-brokered keyless SERP, ≤8 results	title · url · snippet
wb	CLI args	runs the `wb` CLI in-process	vars, toolkit list/show/run
file_issue	title · need · tried	files a metacognitive ticket, tells the agent to carry on	the wall-hit escape valve
vfs_read / vfs_write	a path · content	shared OS workdir (exec) or per-run in-memory VFS (non-exec)	path must stay inside the workdir
done	a result	short-circuits the loop with that result	the clean exit
— exec grant below —
git	only a commit message	host runs `commit_and_push(workdir, msg, tenant)`	non-exec → `"git not permitted"`
publish	nothing — the intent	copies `content/ + blog/` to the public web root	same permission gate
image	a prompt · dest path	host-held image lane, path-traversal guarded	2/run, counted at attempt

Two of those rows hide a real story. shell is the in-WASM pipe shell — it speaks |, ; && || with short-circuit, variables, and redirection confined to preopened dirs, but there is no OS process behind it; each stage is a wasm instance. And it treats 2>/dev/null, 2>&1, any /dev/* redirect as a silent no-op rather than an error — which sounds trivial until you learn it was once the single biggest source of a production agent's per-run thrash. The image budget of two is spent on attempt, not on success: a failed generation still burns a slot, with the message "image budget exhausted (2/run) — plan banners, don't spray".

brokered, not BESTOWED

Depth rung. "The host runs git for you" sounds like native shell access with extra steps. It isn't, and the difference is the whole trust model: the agent can never choose a command line. For git it supplies a commit message and nothing else — the host decides it's commit_and_push against this workdir for this tenant. There is no string the agent can write that becomes rm -rf on the host, because there is no place to write it.

This is enforced, not promised. The old native-exec hatch (a generic run tool, an OS sandbox wrapper) was deleted, and a test stands guard so it can't quietly return:

test "no run tool on either surface" do
  refute "run" in tool_names(base_agent)
  refute "run" in tool_names(exec_agent)
  # Workbooks.Sandbox (native bwrap/seatbelt) is gone;
  # native-compiler fallbacks must error, not exec.
end

The other half of trust is containment. Even a trusted agent's vfs_read and vfs_write paths must resolve strictly inside its workdir; a path that escapes returns "write blocked: path escapes your working dir". Without that guard an exec agent could read /etc/passwd, the host's own .ex files, or another tenant's repo — the hole that would make any "confined to the config layer" claim a lie. Non-exec agents that try git get "git not permitted (no exec capability)" — and note that's a tool result the model reads, not an HTTP 403 thrown at a caller. The permission is part of the conversation.

the run narrates itself, THREE ways

You should never have to guess what a run did. By construction it leaves three records, at three latencies — and the middle one is written whether the caller asked for it or not.

flowchart TD
  step["one step event
{step, agent, tool, args, output,
exit_code, error, dur_ms, ts}"]
  step --> live["live — on_step callback →
WS frames: {type:step,...} {type:done}"]
  step --> jsonl["always-on — _steps.jsonl
one JSON line/step, output ≤200ch"]
  step --> org["at finish — events.org
one headline/step, OQL-queryable"]
  jsonl --> ledger["the signed ledger
(hash-chains the raw lines)"]
  jsonl --> dreams["the dream phase
(digests it)"]
  jsonl --> wire["public activity wire
(grouped by agent tag)"]
  style step fill:#9fc4e8,stroke:#121316,stroke-width:2.5px
  style jsonl fill:#aee5c2,stroke:#121316

Read the fan-out. Each step produces one event with real fields — step, agent (which agent, when there is one), tool, args, output, exit_code, error, dur_ms (monotonic), and ts. That event flows to three sinks. Live: an on_step callback fans out to WebSocket subscribers, so GET /api/run/:id/stream pushes {type:"step"} frames then a final {type:"done"}. Always-on: log_step appends one JSON line per step to <workdir>/_steps.jsonl — lock-free, output clipped to 200 chars, "regardless of any caller-supplied callback, so nothing escapes by construction"; even the write failing is swallowed rather than disrupting the run. At finish: event_log renders the events into an org-mode document, events.org.

That middle layer is load-bearing, not decorative. The signed ledger hash-chains the raw jsonl lines and signs the head with the tenant's key; the dream phase digests the same file; the public live-activity wire groups it by the per-step agent tag. The raw trace is the substrate everything downstream reads. (It's also on the never-leaves-home privacy list, alongside the ledger and telemetry db.)

Here's one real step, both ways. The _steps.jsonl line — the during-the-run view:

{"step":3,"agent":"waldo","tool":"shell","args":{"pipeline":"cat data.json | jq .users[].name | sort -u"},"output":"ada\ngrace\n","exit_code":0,"error":null,"dur_ms":412,"ts":1765532191}

And the same step at finish, rendered into events.org — a headline tagged :tool_call:, an :ARGS: drawer carrying the JSON, then the output:

* Agent run                                                  :session:
** step 3: shell                                  :tool_call:
   :PROPERTIES:
   :ARGS: {"pipeline":"cat data.json | jq .users[].name | sort -u"}
   :END:
   ada grace
* Result
  published the roster page

Note the truncations as you read across the layers: 200 chars in the jsonl, 300 in the org body, 4000 in the transcript the model saw. The trace is a faithful skeleton, not a byte-perfect recording — and the agent field stamps every line with which agent did the work.

every failure becomes TEXT

Depth rung — the loop's failure philosophy gathered in one place. Nothing here throws the run off the rails; each failure lands in the transcript as a string the model reads on its next turn and (later) you read in the trace. A run always terminates with a result.

what went wrong	what the model gets back
malformed tool args	shell error: required arg `pipeline` missing or not a string
a tool ran past 150 s	`tool error: <name> timed out after 150s (killed)`
image budget spent	`image budget exhausted (2/run) — plan banners, don't spray`
no exec capability	`git not permitted (no exec capability)`
path escapes workdir	`write blocked: path escapes your working dir`
the LLM call errored	run finishes: `error: ...`
ran out of steps	run finishes: `stopped: reached max_steps (N)`

And the most interesting failure isn't a failure of the loop — it's the agent admitting it hit a wall. file_issue is a tool: when an agent's equipment falls short ("my toolkit lacks a verb for this"), it files a metacognitive ticket and the reply tells it to note it and carry on rather than stall. The wall becomes a tool call, the run keeps moving, and the autopoet picks the ticket up later. Hitting a limit is, itself, just another line in the transcript.

HONESTY

What this machine is, and what it deliberately isn't.

A step budget is a budget, not judgment. A run that hits max_steps stops mid-thought with a literal "stopped: reached max_steps" — it doesn't summarize, doesn't wrap up gracefully. The number is a leash, not intelligence.
150 seconds kills slow-but-honest work too. A legitimately long-running tool gets the same guillotine as a hung one. The bound buys you a run that can't wedge; it costs you the occasional good tool that needed three minutes.
Truncation is everywhere — 4000 chars in the transcript, ~500 in a live frame, 300 in events.org, 200 in _steps.jsonl. The trace records faithfully what happened, not every byte, and not why.
events.org only exists at finish. While a run is live, the during-the-run view is _steps.jsonl (and the WS stream). The narrative document is written when the loop returns.
Trust is binary, and host-granted. An agent is exec or it isn't; it never acquires the grant itself. The capability flows down from the host, never up from the agent.

And the framing this all serves: this is observability for people building with agents — a trace you can read, a run that can't run away from you. It is never a pitch about software that runs itself. The loop makes the work legible and bounded. It doesn't make the human optional.

questions people actually ASK

At max_steps, does it summarize first?

No. The guard clause fires the moment step ≥ max and the run finishes with the literal string "stopped: reached max_steps (N)". There's no wrap-up turn. If you want a clean ending, give the budget room or prompt the agent to call done when it's near the edge.

Can I watch a run live?

Yes. POST /api/run returns 202 immediately with {id, status:"running"} — the run continues in a supervised process that outlives the caller. Then GET /api/run/:id/stream is a WebSocket of {type:"step"} frames ending in {type:"done"}, or you can poll GET /api/run/:id for %{status, steps, result, tools, events_org}.

Can the agent lie in its own trace?

_steps.jsonl is written by the host, per step, not by the model — the agent doesn't author it. Whether someone can tamper with it after the fact is a different question, and the answer is the ledger: it hash-chains the raw lines and signs the head, so editing history is detectable.

What model runs the loop?

Any OpenRouter chat model. The default is xiaomi/mimo-v2.5; a per-agent definition sets a :MODEL: property, and WB_LLM_MODEL overrides globally. The loop is model-agnostic — it only needs a turn that can emit tool calls.

Can the model call several tools in one step?

Yes — and they share one step number. Tool execution reduces over all the calls in a turn without bumping the counter; the bump happens once per trip around the loop. So a step can carry one tool call or five.

Is exec honored over HTTP?

Only on desktop, or when WB_AGENT_EXEC=1 is set — never for arbitrary multi-tenant callers. A request asking for exec:true against a shared endpoint gets a plain non-exec run.

keep GOING

This was one shift under the microscope. The parent frames the whole job; the live siblings show where the trace and the wall-hits go next.

Agentsthe worker this loop is one shift of

→

The Autopoetwhere file_issue tickets get answered

→

Workflowsagent components call this loop with max_steps 6

→ ✳

Orgthe grammar events.org is written in

→