software that works UNWATCHED
You run things that work while you're not looking — keepers overnight, agents on a schedule, a workflow that fires at nine. The next morning the only question that matters is small and unanswerable in most stacks: what did it actually do?
The industry's answer to that question is a zoo. Stdout in one place, app logs in another, traces in a third, metrics in a fourth, a dashboard bolted over all of it, and an APM bill at the end of the month. Each tool has its own format; each integration is its own little project. You don't have an answer — you have five partial answers in five shapes, and the work of reconciling them is yours.
For agent systems it's worse, because the interesting unit isn't a
request — it's a tool call. The thing you want to know is the
sequence of moves: read this, ran that, fetched the other, hit a wall here.
And half of those moves happen inside a sandbox you
can't printf from. The most observable-hungry collaborator you've
ever run is the one you can see the least.
the DEFINITION
1. one event shape — a tool-call record of nine fields — written at one chokepoint, appended to one file per run; where every observer is a reader of those lines, never a second logger.
The whole design is in that last clause. There is one grammar of signal — the tool-call event — and many literacies: the summary command reads it, the website wire reads it, the dream digest reads it, the ledger seals it. Nothing exports a second format because nothing needs to. Here are the nine fields, with their real limits:
| field | what it holds |
|---|---|
step | monotonic counter for this run (the Dock lane uses an atomic counter; WASM spans leave it null) |
agent | which agent took the step — the agent's name, or null for a singleton. This is how the activity wire groups by worker. |
tool | the tool name — shell, read, or a prefixed origin: command:<name>, wasm:<name> |
args | the call's arguments — path, cmd, query, url |
output | the result — sliced to 4000 chars in memory, 200 chars in the file line. A record, not a transcript. |
exit_code | 0 for success, non-zero for failure |
error | the error string, or null |
dur_ms | wall time from the monotonic clock |
ts | wall-clock seconds — system_time(:second) |
That file is _steps.jsonl — one JSON line per tool call,
appended for the life of the run. Everything else on this page is a way of
reading it.
the CHOKEPOINT
The reason nothing escapes is structural, and it's worth being precise
about. The event isn't built by the caller and isn't optional. It's
assembled and appended at a single function — Agent.log_step —
that sits inside the tool-call loop, fired for every step regardless
of any caller-supplied on_step hook. A caller can subscribe to
the live feed; a caller cannot opt a step out of the record. The phrase in
the source is exact: nothing escapes by construction.
sequenceDiagram participant M as the model participant L as the tool loop participant F as _steps.jsonl participant S as on_step subscriber (optional) M->>L: call a tool Note over L: exec_bounded — 150s wall-clock ceiling L->>F: append one event (lock-free) L-->>S: same event, if anyone is watching Note over F: the append always happens
the subscriber is a bonus
Two details in that picture carry weight. First, the append is lock-free — it's the cheap, common path, so logging never becomes the bottleneck the loop is trying to observe. Second, every tool call is wrapped in a 150-second wall-clock bound. A wedged tool doesn't stall the run forever and vanish from the record — it times out, and the timeout becomes a tool-error event like any other. A hang is data, not a black hole. That single property is why the summary you read in the morning can be trusted to be complete even when last night went badly.
three writers, one GRAMMAR
Depth rung — skippable, but it's the part that makes the rest free. An agent doesn't only call native tools. It calls toolkit commands across the Dock membrane, and it runs work inside the WASM sandbox. Those are different worlds with different boundaries — and all three write the same file in the same shape:
flowchart TD n["native agent tools
tool: shell · read · fetch"] d["Dock command calls
tool: command:<name>"] w["WASM spans
tool: wasm:<name>"] f[["_steps.jsonl — one shape, one file"]] n --> f d --> f w --> f f --> sum["summary / index"] f --> wire["the /_activity wire"] f --> dream["the dream digest"] f --> seal["the signed ledger"] style f fill:#aee5c2,stroke:#121316,stroke-width:2.5px style n fill:#ffffff,stroke:#121316 style d fill:#9fc4e8,stroke:#121316 style w fill:#f3c5a3,stroke:#121316
The convergence is the whole trick. A Dock command call appends
tool: "command:<name>" with its own exit code and timing;
its step counter is an atomic so concurrent commands can't collide. The
workdir it writes to is held in host context — the component
inside the sandbox never sees the path, so a guest can't aim a write at the
log. A WASM span — the host side of an instrument-enter /
instrument-exit import pair — writes
tool: "wasm:<name>" on exit, with the span's duration
computed from a public span-stack because enter and exit are two separate
crossings back into the host.
Because all three land the same nine fields in the same file, the reader that rolls up a run needs zero new query code to count a sandboxed command the same as a native one. One grammar in; one read out. That's not a convenience — it's the reason the read stack below is small.
One honest note on this lane: the WASM-span path is a complete host
sink, but it isn't wired end-to-end yet. No telemetry
capability exists in the policy profiles for a guest to call it through, and
no test exercises it. The feasibility spike confirmed nested spans roll up
correctly — a two-call run summing to forty milliseconds — but the guest-side
transform tooling is an external blocker. We'd rather mark that clearly than
imply the sandbox is already narrating itself.
the READ stack
Once the lines exist, observing is just reading them at different altitudes. The same grammar answers different questions:
| reader | question it answers | surface | liveness |
|---|---|---|---|
summary/1 | what happened in this one run? | wb telemetry <slug> · /api/telemetry/:slug | live — works mid-flight, no db needed |
index/2 | what ran lately, across sessions? | wb telemetry · /api/telemetry | live — a pure scan, no extra writes |
persist/3 | the durable per-run query db | _telemetry.db (SQLite) | at run end — single writer, no contention |
/_activity | what is it doing right now? (public) | the public plane, anonymous read-only | live — last 8 lines, slimmed |
AgentStream | watch this run unfold, step by step | /api/run/:id/stream WebSocket | live — per-step frames |
The two live readers earn the most. summary/1 is
universal and live: it rolls a run up into stage, task count, tool
calls, total milliseconds, errors, and the last fifteen steps — by reading
the file directly, so any run is observable even mid-flight and even with no
persisted database. index/2 is the cross-session view, newest
first: it's a pure scan of the runs directory, which means it costs
nothing and can't drift from the per-run truth, because it has no truth of
its own to drift from.
Here's the loop you'll actually live in. The index, then a single run, then the proof:
$ wbx telemetry SLUG STAGE CALLS ERRORS MS wulu-refresh done 42 0 183204 brand-run error 17 3 96110 $ wbx telemetry brand-run stage=error calls=17 errors=3 total_ms=96110 ! step 9 shell: tool timeout ! step 11 fetch: exit 1 … $ wb ledger wulu-refresh tamper-evident=ok attributable=ok count=42 did=did:key:z6Mk…
That second block is the morning answer to what did my agent actually do — the failing run named, the two bad steps quoted with their exit shape, the timeout from the 150s bound showing up exactly where the chokepoint promised it would. No dashboard, no integration. One command reading one file.
Two softer readers ride on the same feed. The
/_activity wire is the anonymous, read-only public view —
it tails the last few lines of the tenant's _steps.jsonl and
slims each to a tool, a target, a timestamp, and an agent, so a stranger can
watch a public workbook work without seeing its outputs. And
Thoughts writes the eight-word, debounced narration of the live
feed you see on a board — generated lazily, only when someone is actually
watching, and never otherwise.
record → MEMORY
Here the record stops being logging and becomes something logs never are: memory. After a run, a sleep phase digests the recent telemetry, the git log, and the backlog into a single journal entry. The agent reads its newest entry when it next wakes — so the trace of last night isn't a graveyard of lines, it's the thing the next run orients against.
sequenceDiagram participant R as run ends participant G as gather(steps + git log + plan) participant M as a small model participant J as rem/*.org participant N as the next waking run R->>G: last 25 steps, reformatted G->>M: digest it M->>J: one org entry — five fixed headings N->>J: read newest at orient time Note over N: resume from carry —
don't re-read the world
The transformation is concrete. The full dream takes the last
25 steps and reformats each line to a terse move —
shell wb toolkit verify rss (exit 0) — then feeds that, the git
log, and the plan to a small model (inception/mercury-2 by
default). Out comes an entry like rem/2026-06-12-0415.org under
five fixed headings: * tale, * goals,
* blue sky, * fears, * verdicts,
* carry. The * verdicts lines —
pick up:, put down:, cancel: — are
applied mechanically to the plan board, and * carry is
the resume-state the next run reads instead of re-reading
everything from scratch.
The cadence is deliberate. A full dream only fires after an
audit: commit and at least fifty minutes since the last one;
it commits as rem: <first line>. A lighter
daydream — forty words, never committed — fires every twelve
minutes or so from just the last six tool names. The agent reads its newest
dream at orient time; it never writes one. Sleeping and waking are separate
jobs, and telemetry is the bridge between them. The full story lives in the
dreaming lesson.
record → PROOF
Depth rung. The same file that feeds memory can be sealed into proof. The ledger doesn't write a second log — it computes a seal over the one the telemetry already wrote.
flowchart LR s[["_steps.jsonl — raw bytes"]] s -- "h_i = sha256(h_i-1 ‖ line_i)" --> chain["hash chain
genesis: workbooks-ledger-v1"] chain -- "sign head with did:key" --> seal["_ledger.json
{v, did, count, head, sig, ts}"] seal -- "anchor: commit into the repo" --> git["the tenant repo"] style s fill:#aee5c2,stroke:#121316,stroke-width:2.5px style seal fill:#13d943,stroke:#121316,stroke-width:2.5px
The chain hashes each raw line into the next, genesis string
workbooks-ledger-v1; the head is signed with the tenant's
Ed25519 did:key. Verification returns two facts —
tamper-evident (no line was changed) and attributable (this
agent, this key, signed it) — plus the count and head. It's the same
wb ledger <slug> line from the read stack above, sealed
automatically at the end of a workflow run. The seal lives over the log, not
beside it; the ledger lesson owns the full
story.
record → SELF-EXTENSION
The last transformation closes a loop. When an agent hits a capability
wall — its toolkit can't do the thing — it doesn't stall and it doesn't
fake success. It files an issue with one field that matters:
tried, the evidence of the wall, which is precisely a
telemetry-shaped trace of what failed. The recorded failure becomes a
request for a new capability.
flowchart LR wall["an agent hits a wall
tried: the failing trace"] --> fi["file_issue"] fi --> bl["the autopoet backlog
org files · SEEN dedup"] bl --> run["the autopoet works it
agent: autopoet"] run --> verify{"wb toolkit verify"} verify -- ok --> done["DONE"] verify -- unverified --> open["downgrade to OPEN"] style wall fill:#f3c5a3,stroke:#121316 style done fill:#13d943,stroke:#121316,stroke-width:2.5px style open fill:#ffffff,stroke:#121316
The reply to the agent tells it to carry on — its job isn't to fix its
own tools mid-run. The issue lands as an org file with a kind and a status;
a duplicate from the same tenant bumps a SEEN count instead of
re-filing, so the backlog is liberal to write but triaged by frequency. The
autopoet picks the most-seen issue first and works
it — and here's the part that matters for this page: the autopoet's own run
goes through the same agent path, so its steps land in the same
_steps.jsonl grammar, stamped agent: "autopoet".
The system that extends the system is observable by the same telemetry it
observes.
One guard is non-negotiable. A self-reported DONE is
independently re-verified with wb toolkit verify; an
unverified DONE is downgraded back to OPEN. The agent's word is a claim, not
a proof — and the same honesty that runs through the record runs through the
fix. The full account is the autopoet lesson's; this
page only owes you the seam.
the SECOND lane
Depth rung. There's a second stream of signal that is deliberately not the step grammar, because it answers a different question: ops metering. Every broker decision — every allow or deny of a capability request — increments an atomic counter keyed by broker and outcome, and denials land in a small forensics ring (the last 128), with the guest-controlled target truncated to 512 bytes so a hostile component can't exhaust memory through the audit itself.
The important move is at the boundary: alongside its own counters, the
broker audit emits a standard Erlang :telemetry event —
[:workbooks, :broker, outcome] with the broker, reason, and
target. That's the well-known observability contract the whole BEAM
ecosystem speaks, so Prometheus, a SIEM, or an APM like AppSignal can attach
to the engine's security signal without coupling to any internal ETS
layout. The step grammar is for you, reading your own runs;
this lane is where the engine's signal meets the outside monitoring world on
the world's terms.
private by DEFAULT
All of this telemetry is intensely personal — it's the minute-by-minute record of how your agent thinks. So the rule is one sentence: sharing exposes work, never the session that produced it. One module owns that boundary, and every egress path — git, bundle, library — consults it before anything leaves the machine.
| sidecar file | what it is | written at | ships when shared? |
|---|---|---|---|
_steps.jsonl | the always-on step log | every tool call | never |
_status.json | stage — running / done / error | at stage transitions | never |
_trace.jsonl | a slim per-step trace (out ≤140) | per step, web runs | never |
_telemetry.db | run-end SQLite query db | at run end | never |
_ledger.json | the signed seal | at workflow end | never |
The boundary isn't a list someone has to maintain — it's a pattern. The
_* prefix paired with a .jsonl / .json /
.db suffix catches every sidecar here and any future
one the same way, so a new telemetry file is private the day it's invented.
The same module auto-writes a .gitignore, which makes
git add -A safe by default — you can't accidentally commit your
own session. When you share a workbook, the work goes
and the diary stays.
what it ISN'T
Honesty section, in full. The WASM-span lane is a complete host sink
with no guest wiring yet — there is no telemetry
capability for a sandboxed component to call it through, and no test
exercises it. The spike confirmed it works; the guest-side transform tooling
is the blocker. Treat sandbox self-narration as confirmed-feasible, not
shipped.
The file is a record, not a transcript. Outputs are sliced to 200 characters in the jsonl line — enough to know what a step did and whether it worked, not enough to replay it verbatim. If you need the full output, you needed it at the moment it ran.
Workflow runs index under an ephemeral path (/tmp/bb).
The live summary and index are real and free, but they're reading a working
directory, not a warehouse — the durable copy is the run-end
_telemetry.db and the sealed ledger, not the scan.
The step grammar is not an OTel exporter. Only the broker lane
emits standard :telemetry events; the per-step record is its
own shape, designed to be read by one file's worth of code, not piped into a
vendor. And the log is editable until sealed — its tamper-evidence
comes from the ledger's hash chain, applied at run
end, not from the append itself.
Last and most important: none of this ever leaves your machine. This is your telemetry — the record you read to understand your own software — not product analytics, not a phone-home, not a metric we collect. The privacy section above isn't a setting. It's the default the whole egress path enforces.
questions people actually ASK
Can I watch a run live?
Yes, two ways. wb telemetry <slug> rolls up a run
even mid-flight, because the summary reads the file directly and needs no
finished database. And /api/run/:id/stream is a WebSocket that
pushes a frame per step as it happens — read this, ran that — then a done
frame. On a public workbook, /_activity shows the slimmed,
anonymous version of the same feed.
Does my telemetry leave my machine?
No. The step log and its sidecars are private by construction — one
module gates every egress path, and the _* naming pattern
keeps them out of git, bundles, and the library automatically. Sharing a
workbook ships the work, never the session that produced it. There is no
collection, no phone-home, no analytics endpoint.
How do I hook up Prometheus or a SIEM?
Through the second lane. The broker audit emits standard Erlang
:telemetry events — [:workbooks, :broker, outcome]
with broker, reason, and target — which is the contract the BEAM
observability ecosystem already speaks. Attach there and you get the
engine's security signal without coupling to any internal layout. The
per-step record is a different shape, meant for reading your own runs, not
for scraping.
Can the agent fake its own log?
The append is written at one chokepoint inside the loop, fired for every step regardless of the caller — so a step can't quietly skip itself. The log is editable after the fact, though, which is exactly why the ledger exists: a hash chain over the raw lines, signed with the tenant's key, makes any later edit detectable and the run attributable. Trust the seal, not the raw file.
Where do the files go when I share a workbook?
Nowhere — they stay. _steps.jsonl, _status.json,
_trace.jsonl, _telemetry.db, and
_ledger.json all match the private-by-default pattern, so the
egress path leaves them behind. The recipient gets your work and your
files; they don't get your run's diary.
Why one file instead of proper logs, traces, and metrics?
Because for agent work the interesting unit is the tool call, and one event shape captures it whole. Splitting it across three subsystems buys you three formats to reconcile and three integrations to maintain. One append-only file means the summary, the website wire, the dream, and the ledger are all just readers — no new query code per reader, no drift between copies, because there's only one copy.
keep GOING
Telemetry is the nexus watching itself — and it feeds three of the most interesting ideas downstream. Start with the parent, then follow the transformations.