DONE is a CLAIM
The parent lesson sold you the good part — the plan is the state, agents
work the ready list, the file is the meeting. Then comes the dread that
follows every agent everywhere: an agent just wrote DONE.
Says who?
By default, says the agent. The same collaborator who hallucinates a filename, declares victory on a half-built artifact, and writes confident prose over an empty directory — that collaborator is the one editing your state. A workflow is only as honest as its least honest writer, and the least honest writer is usually the one with the most to do.
Three different lies hide under one keyword. The plan can be
incoherent — a task consumes an input nothing produces. A task can be
unfinished — the work isn't done, but the headline says it is. The
artifact can be malformed — it compiled, but it doesn't fit the slot
it has to dock into. One word, DONE, covers all three. The
question this page answers is how each one stops being a claim and becomes a
checkable fact.
the DEFINITION
1. a check that converts a claim in the
plan into a named diagnostic — {level, scope, message} —
and is fail-closed: anything it cannot positively confirm reads as a
failure, never as probably fine.
Validation isn't one checker. It's a ladder of three, and the rungs are keyed to time — each fires at a different moment in a plan's life, and each answers a different lie:
| rung | when it fires | the question it answers | where it lives |
|---|---|---|---|
| 1 · plan | before anything runs | does the plan even cohere? | validate in oql.wasm |
| 2 · acceptance | after each task | is DONE actually earned? | Workflow.Todo.validate |
| 3 · conformance | before anything docks | does the artifact fit the world? | Conformance.engine? |
The shared trick is the diagnostic. A claim is unfalsifiable — you either believe the writer or you don't. A diagnostic has a scope (the named thing that's wrong) and a message (what's wrong with it). That single shape is the whole difference between drift-as-shrug and drift-as-bug-report, and it's the promise the parent lesson made that this page mechanizes.
rung 1 — validating the PLAN
The first rung runs before a single task executes. It lives in the
OQL kernel — the same pure string→string function that
ships both as a native library inside the wbx CLI and as the
workbooks:oql WASM component. There is no “wasm version”
of the logic and a separate native one; it is one function, run everywhere.
validate only looks at subtrees tagged :workflow:.
Inside one, it collects every :component: descendant, builds the
set of every value any component produces (its :out), and
then makes exactly two checks per component:
flowchart LR
org["org text"] --> parse["parse headlines"]
parse --> wf{":workflow:?"}
wf -- "no" --> skip["ignored"]
wf -- "yes" --> comps["collect :component: descendants"]
comps --> prod["build produced-set
(every :out)"]
prod --> c1{"has source
block / lang?"}
prod --> c2{"every :in has
a producer?"}
c1 -- "no" --> d1["error: component has
no source block / language"]
c2 -- "no" --> d2["error: input X has
no upstream producer"]
d1 --> out[["diagnostics JSON"]]
d2 --> out
c1 -- "yes" --> out
c2 -- "yes" --> out
style org fill:#f2ddb0,stroke:#121316
style out fill:#13d943,stroke:#121316,stroke-width:2.5px
style d1 fill:#f3c5a3,stroke:#121316
style d2 fill:#f3c5a3,stroke:#121316
Walk the graph: org text comes in, headlines are parsed, and anything that
isn't a :workflow: falls out the side — ignored. What remains is
the components. The kernel gathers everything they produce into one
produced-set, then asks each component two questions. Does it have a source
block, so it can actually run? And does every input it names appear in the
produced-set, so something upstream actually makes it? A no to either
becomes a diagnostic. That's the whole of rung one — two checks, no more.
This is deliberately a small, fast net, not a theorem prover. It catches the two structural lies that make a plan un-buildable, and it catches them in the same wasm whether you run it from the CLI, a WebSocket, or an HTTP call.
the shape of a FINDING
Here is the promise made literal. A broken plan — verified against the real
wbx binary — names a producer-less input and a component with no
source block:
* Broken workflow :workflow: ** Summarize :component: #+begin_src js :in events:list :out summary:string export default (e) => e; #+end_src ** Orphan task :component:
$ wbx lint broken.org
[{"level":"error","message":"input `events:list` has no upstream producer","scope":"Summarize"},
{"level":"error","message":"component has no source block / language","scope":"Orphan task"}]
Two errors, each with a scope — and the scope is the name.
The parent lesson promised drift would get “a name and a line number
instead of a shrug”; the code is precise about the first half and honest
about the second. Drift is named by scope — the component's name,
Summarize and Orphan task — not addressed by line.
A name is enough to act on; we'd rather tell you it's a name than imply a line
cursor that isn't there.
The same diagnostic shape surfaces through four doors:
| surface | the verb |
|---|---|
wbx lint <file.org> | runs validate, prints the JSON |
wbx bundle | refuses to assemble on any diagnostic |
runtime WebSocket "validate" | same wasm, over a socket |
POST /w/:id/call fn=validate | same wasm, over HTTP |
That second row is a gate, not a courtesy. Bundling
is where the plan becomes a shippable workbook, and it will not assemble a
source that doesn't lint clean — the error reads source has diagnostics
— fix them first (wb lint), followed by the very JSON above.
wbx init even lints its own scaffold on the way out, so a template
bug fails the template, not you.
rung 2 — DONE has a PRICE
Rung one proves the plan could run. Rung two proves a task
did. The module is Workbooks.Workflow.Todo, and its own
docs name the idea well: a task only reaches DONE when its
validation passes — unit tests for org mode. A task can carry its own
acceptance criterion, and the runner honors a strict precedence:
First, a :done-when: property — a shell command that must
exit clean. Second, failing that, the first #+begin_src <lang>
... :check block in the body — its command is run. Third, failing both,
the answer is simply true: no check means the agent's own
completion stands. Pass → the headline's state becomes
DONE; fail → FAILED.
Here's a real gate, from a workflow a model actually authored — a research plan that downloads a pricing page and refuses to call the step done until the file is non-empty:
* TODO Gather E2B Pricing Information :PROPERTIES: :ORDERED: t :END: - Use `curl` to download the official E2B pricing page ... save it to `scratch/e2b_pricing.html`. :done-when: test -s scratch/e2b_pricing.html
sequenceDiagram participant A as the agent participant R as the runner participant S as the WASM shell participant F as the plan file A->>R: I'm finished R->>S: run check + sentinel
(test -s … && echo OK) S-->>R: stdout R->>R: sentinel present? alt sentinel found R->>F: state = DONE else missing / error / unknown command R->>F: state = FAILED end Note over F: the verdict is written
INTO the headline's state
Read the exchange as a story. The agent says it's finished. The runner
doesn't take its word — it runs the check in a sandboxed shell, appends a
secret sentinel, and watches the output. If the sentinel comes back, the
task's record becomes DONE. If the file was missing, the command
errored, or the command wasn't one the shell even knows — the record becomes
FAILED. The verdict lands in the headline, so the file itself
carries the truth. And if the author wrote no check at all, the runner trusts
the agent — the gate is opt-in, per task.
the shell the check runs IN
Depth rung — skippable, but it's where acceptance earns the word fail-closed. A check command is author-supplied. That makes it a potential exec vector, so it must never touch a native shell. Every check runs inside the in-WASM shell, over the CommandRegistry — never native sh (this is the no-bash-outside-WASM rule the whole runtime is built around).
The mechanism is a sentinel trick. The runner doesn't ask the shell
“did this succeed?” — it appends && echo
__WB_CHECK_OK__ and then checks whether that exact string appears in
the output. Shell semantics do the rest: the sentinel only prints if the
command before it exited zero. Then it reads as pass only on a literal match —
and nothing else reads as pass:
flowchart TD cmd["author check, e.g.
test -s scratch/REPORT.org"] --> add["append && echo __WB_CHECK_OK__"] add --> run["run in WASM shell
workdir preopened, nothing else"] run --> q{"output contains
__WB_CHECK_OK__ ?"} q -- "yes" --> pass["PASS → DONE"] q -- "no" --> fail["FAILED"] run --> err{"error tuple / raised
exception / unknown cmd"} err -- "any" --> fail style cmd fill:#f2ddb0,stroke:#121316 style pass fill:#13d943,stroke:#121316,stroke-width:2.5px style fail fill:#f3c5a3,stroke:#121316,stroke-width:2px
Trace every path: the author's command gets the sentinel bolted on, runs in
a shell with only the working directory preopened, and the output either
contains the sentinel or it doesn't. Contains it → pass. Doesn't →
fail. And the side branch matters as much as the main one — an error tuple, a
raised exception, or a command the registry doesn't carry all collapse to the
same FAILED. There is no path where uncertainty becomes a pass.
The shell itself is real: Workbooks.Shell runs pipelines of
registered WASM commands — one wasmtime instance per stage, stdout piped to the
next stage's stdin in memory — with |, ;,
&&, ||, variable assignment and expansion, and
> >> < redirection confined to preopened dirs. A real
coreutils pipeline runs entirely in WASM. The honest consequence: a correct
check that happens to use a command the WASM shell can't run does not
pass — the task stays un-DONE rather than escaping to native
sh. Fail-closed cuts both ways, on purpose.
what a run leaves BEHIND
Depth rung — skippable. Acceptance isn't only a gate at the moment of finishing; it leaves an audit trail, so “did the check ever pass?” is a query, not a memory. A run produces three durable things:
| artifact | what's in it | question it answers |
|---|---|---|
| the headline's state | DONE / FAILED / PARTIAL | what's the current truth? |
_telemetry.db | task_events(run_id, task_id, idx, title, state, output, ts) | did this check ever pass — in SQL? |
ledger-sealed _steps.jsonl | hash-chained, signed step log | can I prove it wasn't tampered with? |
The verdict of that table: the state is the live answer, the telemetry db is the searchable history, and the ledger is the tamper-evident, attributable receipt. The telemetry is always on; the ledger seal is best-effort and never blocks a run.
Two execution facts fall out of the same machinery. Already-DONE
tasks are skipped on a re-run, so runs are resumable — re-running
a half-finished plan picks up exactly where it failed. And a composite parent
is DONE only when all its children are; otherwise it's
PARTIAL. Unordered siblings run in parallel; an
:ORDERED: property forces them into a sequence.
rung 3 — does it DOCK?
The plan cohered. The task earned its DONE. One claim remains:
the compiled artifact actually fits the slot it has to dock into. That's
Workbooks.Conformance, and its question is narrow and exact —
does this built component match the workbooks:engine world?
It extracts the component's interface with wasm-tools component wit,
then makes two checks. The component must export run. And
every world-level function it imports must be one of the three engine
Dock funcs — session-info, vfs-query,
run-command — with WASI imports always permitted:
flowchart LR
wasm[".wasm"] --> wit["wasm-tools component wit"]
wit --> nc{"is it a
component?"}
nc -- "no" --> e0["error: not_a_component"]
nc -- "yes" --> ex{"exports
run: func ?"}
ex -- "no" --> e1["missing required export run"]
wit --> im{"imports only
session-info / vfs-query /
run-command (+ wasi:*) ?"}
im -- "no" --> e2["undeclared Dock import X
(not in workbooks:engine)"]
ex -- "yes" --> ok[":ok"]
im -- "yes" --> ok
style wasm fill:#f2ddb0,stroke:#121316
style ok fill:#13d943,stroke:#121316,stroke-width:2.5px
style e1 fill:#f3c5a3,stroke:#121316
style e2 fill:#f3c5a3,stroke:#121316
The teaching moment is the demo's two verdicts, checked against the real module:
Workbooks.Conformance.engine?("build/fixtures/engine-probe.wasm")
# => :ok exports run, imports only Dock funcs
Workbooks.Conformance.engine?("build/oql.wasm")
# => {:error, ["missing required export `run`"]} a DIFFERENT world — rejected
The probe passes. The kernel itself — oql.wasm, perfectly valid
wasm — is rejected, because it's a different world: it has no
run export. That's the whole point of rung three. Conformance
isn't asking “is this good wasm?” It's asking “is this
this world's wasm?” A thing can be flawless and still not fit
the slot, and a docking check that can't tell the difference is no check at all.
There's an adjacent gate worth one sentence: check_upgrade
diffs a deployed world against a new one — exports may grow but never shrink,
imports may shrink but never grow — and gates breaking changes the way WIT and
Candid subtyping do. That belongs to a future upgrades lesson; here
it's just a pointer.
where it BITES
Honesty section. Validation is real, but it is not omniscient, and a few of its edges are load-bearing for your judgment.
No check means trust. Acceptance is opt-in. A task with neither a
:done-when: nor a :check block reaches
DONE on the agent's word — exactly the claim this page started by
distrusting. The gate is only as strong as the authoring discipline that
writes gates.
Fail-closed produces false negatives. A correct piece of work whose
check uses a command the WASM shell can't run will FAIL honestly.
We chose that over the alternative — a check that silently escapes to native
sh — but the cost is real: sometimes the work is fine and the
check isn't.
Diagnostics carry scope, not line numbers. Rung one names the component that's wrong; it does not point a cursor at the offending line. The parent lesson's phrasing over-reached there, and the precise truth is: a name, and a message.
The kernel validates two things, not twenty. No cycle detection, no
“DONE resting on a TODO”, no :DEPENDS:-points-at-nothing.
Some of that is by design rather than omission: the runtime's execution model
is a tree — nesting plus :ORDERED: — which cannot
express a cycle at all. That's correctness by construction, not a
checker you can lean on. Where the structure can't hold a bug, there's nothing
to detect.
Conformance is a demoed seam, not a universal gate. The check is real and exercised by its demo; presenting it as wired into every build path would overclaim. Treat it as the contract the build should enforce, demoed end-to-end, with universal wiring still in front of it.
One CLI's lint is a stub. The Elixir CLI's
wb lint maps to a kernel lint that returns an empty
list today; the real checker is validate, reached by the Rust
wbx lint. Use wbx lint when you want the diagnostics
this page describes.
questions people actually ASK
What happens if I write no check at all?
The task reaches DONE on the agent's word. Acceptance is
opt-in per task: a :done-when: command or a
#+begin_src sh :check block gates a leaf; with neither, the
runner trusts the worker. The check is the difference between a claim and a
fact — no check, just a claim.
Can a check run arbitrary commands?
Only commands the WASM shell carries in its registry. Author-supplied
check strings never touch native sh — they'd be an exec vector
if they did. That's a deliberate constraint, not a limitation to route
around: the sandbox is the safety.
Why did my task FAIL when the work looks done?
Almost always fail-closed in action: the check command isn't one the WASM
shell knows, or it errored, or it didn't print its success sentinel. An
unknown command, a raised exception, a missing file — all read as
FAILED, never as “probably fine”. Check that your
:done-when: uses a registered command and a path under the
preopened workdir.
Does validation catch dependency cycles?
No — and it doesn't need to. The kernel makes two plan checks (dangling input, missing source block), not a cycle detector. The execution model is a tree, so a cycle can't be expressed in the first place. It's correctness by construction rather than by check — which means the absence of a cycle detector is a design answer, not a gap.
Is wbx bundle failing on diagnostics a bug?
No — it's the gate. Bundling refuses to assemble a workbook that doesn't
lint clean; the error source has diagnostics — fix them first
is the feature working. Run wbx lint, read the named scopes, fix
them, and the bundle proceeds.
Where do FAILED results go?
Two places. The headline's state in the plan file becomes
FAILED, and a row lands in _telemetry.db's
task_events table — run id, task id, title, state, output, and
timestamp. So “did this ever pass?” is a SQL query, not an
archaeology project.
keep GOING
Validation is the honesty layer under three ideas — start with its parent.