validations — who says DONE is true

DONE is a CLAIM

The parent lesson sold you the good part — the plan is the state, agents work the ready list, the file is the meeting. Then comes the dread that follows every agent everywhere: an agent just wrote DONE. Says who?

By default, says the agent. The same collaborator who hallucinates a filename, declares victory on a half-built artifact, and writes confident prose over an empty directory — that collaborator is the one editing your state. A workflow is only as honest as its least honest writer, and the least honest writer is usually the one with the most to do.

Three different lies hide under one keyword. The plan can be incoherent — a task consumes an input nothing produces. A task can be unfinished — the work isn't done, but the headline says it is. The artifact can be malformed — it compiled, but it doesn't fit the slot it has to dock into. One word, DONE, covers all three. The question this page answers is how each one stops being a claim and becomes a checkable fact.

the DEFINITION

val·i·da·tion /ˌvæl·ɪˈdeɪ·ʃən/ noun

1. a check that converts a claim in the plan into a named diagnostic — {level, scope, message} — and is fail-closed: anything it cannot positively confirm reads as a failure, never as probably fine.

Validation isn't one checker. It's a ladder of three, and the rungs are keyed to time — each fires at a different moment in a plan's life, and each answers a different lie:

rung	when it fires	the question it answers	where it lives
1 · plan	before anything runs	does the plan even cohere?	`validate` in `oql.wasm`
2 · acceptance	after each task	is DONE actually earned?	`Workflow.Todo.validate`
3 · conformance	before anything docks	does the artifact fit the world?	`Conformance.engine?`

The shared trick is the diagnostic. A claim is unfalsifiable — you either believe the writer or you don't. A diagnostic has a scope (the named thing that's wrong) and a message (what's wrong with it). That single shape is the whole difference between drift-as-shrug and drift-as-bug-report, and it's the promise the parent lesson made that this page mechanizes.

rung 1 — validating the PLAN

The first rung runs before a single task executes. It lives in the OQL kernel — the same pure string→string function that ships both as a native library inside the wbx CLI and as the workbooks:oql WASM component. There is no “wasm version” of the logic and a separate native one; it is one function, run everywhere.

validate only looks at subtrees tagged :workflow:. Inside one, it collects every :component: descendant, builds the set of every value any component produces (its :out), and then makes exactly two checks per component:

flowchart LR
  org["org text"] --> parse["parse headlines"]
  parse --> wf{":workflow:?"}
  wf -- "no" --> skip["ignored"]
  wf -- "yes" --> comps["collect :component: descendants"]
  comps --> prod["build produced-set
(every :out)"]
  prod --> c1{"has source
block / lang?"}
  prod --> c2{"every :in has
a producer?"}
  c1 -- "no" --> d1["error: component has
no source block / language"]
  c2 -- "no" --> d2["error: input X has
no upstream producer"]
  d1 --> out[["diagnostics JSON"]]
  d2 --> out
  c1 -- "yes" --> out
  c2 -- "yes" --> out
  style org fill:#f2ddb0,stroke:#121316
  style out fill:#13d943,stroke:#121316,stroke-width:2.5px
  style d1 fill:#f3c5a3,stroke:#121316
  style d2 fill:#f3c5a3,stroke:#121316

Walk the graph: org text comes in, headlines are parsed, and anything that isn't a :workflow: falls out the side — ignored. What remains is the components. The kernel gathers everything they produce into one produced-set, then asks each component two questions. Does it have a source block, so it can actually run? And does every input it names appear in the produced-set, so something upstream actually makes it? A no to either becomes a diagnostic. That's the whole of rung one — two checks, no more.

This is deliberately a small, fast net, not a theorem prover. It catches the two structural lies that make a plan un-buildable, and it catches them in the same wasm whether you run it from the CLI, a WebSocket, or an HTTP call.

the shape of a FINDING

Here is the promise made literal. A broken plan — verified against the real wbx binary — names a producer-less input and a component with no source block:

* Broken workflow                                   :workflow:
** Summarize                                        :component:
   #+begin_src js :in events:list :out summary:string
   export default (e) => e;
   #+end_src
** Orphan task                                      :component:

$ wbx lint broken.org
[{"level":"error","message":"input `events:list` has no upstream producer","scope":"Summarize"},
 {"level":"error","message":"component has no source block / language","scope":"Orphan task"}]

Two errors, each with a scope — and the scope is the name. The parent lesson promised drift would get “a name and a line number instead of a shrug”; the code is precise about the first half and honest about the second. Drift is named by scope — the component's name, Summarize and Orphan task — not addressed by line. A name is enough to act on; we'd rather tell you it's a name than imply a line cursor that isn't there.

The same diagnostic shape surfaces through four doors:

surface	the verb
`wbx lint <file.org>`	runs `validate`, prints the JSON
`wbx bundle`	refuses to assemble on any diagnostic
runtime WebSocket `"validate"`	same wasm, over a socket
`POST /w/:id/call` `fn=validate`	same wasm, over HTTP

That second row is a gate, not a courtesy. Bundling is where the plan becomes a shippable workbook, and it will not assemble a source that doesn't lint clean — the error reads source has diagnostics — fix them first (wb lint), followed by the very JSON above. wbx init even lints its own scaffold on the way out, so a template bug fails the template, not you.

rung 2 — DONE has a PRICE

Rung one proves the plan could run. Rung two proves a task did. The module is Workbooks.Workflow.Todo, and its own docs name the idea well: a task only reaches DONE when its validation passes — unit tests for org mode. A task can carry its own acceptance criterion, and the runner honors a strict precedence:

First, a :done-when: property — a shell command that must exit clean. Second, failing that, the first #+begin_src <lang> ... :check block in the body — its command is run. Third, failing both, the answer is simply true: no check means the agent's own completion stands. Pass → the headline's state becomes DONE; fail → FAILED.

Here's a real gate, from a workflow a model actually authored — a research plan that downloads a pricing page and refuses to call the step done until the file is non-empty:

* TODO Gather E2B Pricing Information
  :PROPERTIES:
  :ORDERED: t
  :END:
  - Use `curl` to download the official E2B pricing page ... save it to `scratch/e2b_pricing.html`.
  :done-when: test -s scratch/e2b_pricing.html

sequenceDiagram
  participant A as the agent
  participant R as the runner
  participant S as the WASM shell
  participant F as the plan file
  A->>R: I'm finished
  R->>S: run check + sentinel
(test -s … && echo OK)
  S-->>R: stdout
  R->>R: sentinel present?
  alt sentinel found
    R->>F: state = DONE
  else missing / error / unknown command
    R->>F: state = FAILED
  end
  Note over F: the verdict is written
INTO the headline's state

Read the exchange as a story. The agent says it's finished. The runner doesn't take its word — it runs the check in a sandboxed shell, appends a secret sentinel, and watches the output. If the sentinel comes back, the task's record becomes DONE. If the file was missing, the command errored, or the command wasn't one the shell even knows — the record becomes FAILED. The verdict lands in the headline, so the file itself carries the truth. And if the author wrote no check at all, the runner trusts the agent — the gate is opt-in, per task.

the shell the check runs IN

Depth rung — skippable, but it's where acceptance earns the word fail-closed. A check command is author-supplied. That makes it a potential exec vector, so it must never touch a native shell. Every check runs inside the in-WASM shell, over the CommandRegistry — never native sh (this is the no-bash-outside-WASM rule the whole runtime is built around).

The mechanism is a sentinel trick. The runner doesn't ask the shell “did this succeed?” — it appends && echo __WB_CHECK_OK__ and then checks whether that exact string appears in the output. Shell semantics do the rest: the sentinel only prints if the command before it exited zero. Then it reads as pass only on a literal match — and nothing else reads as pass:

flowchart TD
  cmd["author check, e.g.
test -s scratch/REPORT.org"] --> add["append && echo __WB_CHECK_OK__"]
  add --> run["run in WASM shell
workdir preopened, nothing else"]
  run --> q{"output contains
__WB_CHECK_OK__ ?"}
  q -- "yes" --> pass["PASS → DONE"]
  q -- "no" --> fail["FAILED"]
  run --> err{"error tuple / raised
exception / unknown cmd"}
  err -- "any" --> fail
  style cmd fill:#f2ddb0,stroke:#121316
  style pass fill:#13d943,stroke:#121316,stroke-width:2.5px
  style fail fill:#f3c5a3,stroke:#121316,stroke-width:2px

Trace every path: the author's command gets the sentinel bolted on, runs in a shell with only the working directory preopened, and the output either contains the sentinel or it doesn't. Contains it → pass. Doesn't → fail. And the side branch matters as much as the main one — an error tuple, a raised exception, or a command the registry doesn't carry all collapse to the same FAILED. There is no path where uncertainty becomes a pass.

The shell itself is real: Workbooks.Shell runs pipelines of registered WASM commands — one wasmtime instance per stage, stdout piped to the next stage's stdin in memory — with |, ;, &&, ||, variable assignment and expansion, and > >> < redirection confined to preopened dirs. A real coreutils pipeline runs entirely in WASM. The honest consequence: a correct check that happens to use a command the WASM shell can't run does not pass — the task stays un-DONE rather than escaping to native sh. Fail-closed cuts both ways, on purpose.

what a run leaves BEHIND

Depth rung — skippable. Acceptance isn't only a gate at the moment of finishing; it leaves an audit trail, so “did the check ever pass?” is a query, not a memory. A run produces three durable things:

artifact	what's in it	question it answers
the headline's state	`DONE` / `FAILED` / `PARTIAL`	what's the current truth?
`_telemetry.db`	`task_events(run_id, task_id, idx, title, state, output, ts)`	did this check ever pass — in SQL?
ledger-sealed `_steps.jsonl`	hash-chained, signed step log	can I prove it wasn't tampered with?

The verdict of that table: the state is the live answer, the telemetry db is the searchable history, and the ledger is the tamper-evident, attributable receipt. The telemetry is always on; the ledger seal is best-effort and never blocks a run.

Two execution facts fall out of the same machinery. Already-DONE tasks are skipped on a re-run, so runs are resumable — re-running a half-finished plan picks up exactly where it failed. And a composite parent is DONE only when all its children are; otherwise it's PARTIAL. Unordered siblings run in parallel; an :ORDERED: property forces them into a sequence.

validation closes the LOOP

So far validation is a gate at the exit. But the most interesting place it appears is at the entrance — when an agent writes the plan itself. The groundskeeper's author model emits an org outline, and before that outline is allowed to run, validate_outline requires at least one heading in an active task state. If there's none, it fails — and the failure becomes the next prompt:

sequenceDiagram
  participant U as a goal
  participant L as the author model
  participant V as validate_outline
  participant R as the runner
  U->>L: build me this
  L->>V: org outline
  V->>V: any TODO/NEXT/... heading?
  alt none
    V-->>L: Your previous outline was invalid:
… Emit a corrected org outline only.
    L->>V: corrected outline
  end
  V->>R: outline runs
  Note over L,R: the validation error
IS the retry instruction

Walk it: a goal comes in, the author model writes an outline, and the validator asks one question — is there at least one real task heading here? If not, the validator's own error text is handed straight back to the model as the retry instruction: your previous outline was invalid, emit a corrected one only. One retry, then it runs. The error isn't just a rejection — it's a teaching signal.

The same posture runs downstream. The default agent runner's system prompt tells workers: if a task needs acceptance criteria, write them as a #+begin_src sh :check block in scratch and run it yourself before finishing. Validation isn't a thing done to the agents from outside — it's a discipline the agents are taught to author into their own work.

rung 3 — does it DOCK?

The plan cohered. The task earned its DONE. One claim remains: the compiled artifact actually fits the slot it has to dock into. That's Workbooks.Conformance, and its question is narrow and exact — does this built component match the workbooks:engine world?

It extracts the component's interface with wasm-tools component wit, then makes two checks. The component must export run. And every world-level function it imports must be one of the three engine Dock funcs — session-info, vfs-query, run-command — with WASI imports always permitted:

flowchart LR
  wasm[".wasm"] --> wit["wasm-tools component wit"]
  wit --> nc{"is it a
component?"}
  nc -- "no" --> e0["error: not_a_component"]
  nc -- "yes" --> ex{"exports
run: func ?"}
  ex -- "no" --> e1["missing required export run"]
  wit --> im{"imports only
session-info / vfs-query /
run-command (+ wasi:*) ?"}
  im -- "no" --> e2["undeclared Dock import X
(not in workbooks:engine)"]
  ex -- "yes" --> ok[":ok"]
  im -- "yes" --> ok
  style wasm fill:#f2ddb0,stroke:#121316
  style ok fill:#13d943,stroke:#121316,stroke-width:2.5px
  style e1 fill:#f3c5a3,stroke:#121316
  style e2 fill:#f3c5a3,stroke:#121316

The teaching moment is the demo's two verdicts, checked against the real module:

Workbooks.Conformance.engine?("build/fixtures/engine-probe.wasm")
# => :ok                                  exports run, imports only Dock funcs

Workbooks.Conformance.engine?("build/oql.wasm")
# => {:error, ["missing required export `run`"]}   a DIFFERENT world — rejected

The probe passes. The kernel itself — oql.wasm, perfectly valid wasm — is rejected, because it's a different world: it has no run export. That's the whole point of rung three. Conformance isn't asking “is this good wasm?” It's asking “is this this world's wasm?” A thing can be flawless and still not fit the slot, and a docking check that can't tell the difference is no check at all.

There's an adjacent gate worth one sentence: check_upgrade diffs a deployed world against a new one — exports may grow but never shrink, imports may shrink but never grow — and gates breaking changes the way WIT and Candid subtyping do. That belongs to a future upgrades lesson; here it's just a pointer.

where it BITES

Honesty section. Validation is real, but it is not omniscient, and a few of its edges are load-bearing for your judgment.

No check means trust. Acceptance is opt-in. A task with neither a :done-when: nor a :check block reaches DONE on the agent's word — exactly the claim this page started by distrusting. The gate is only as strong as the authoring discipline that writes gates.

Fail-closed produces false negatives. A correct piece of work whose check uses a command the WASM shell can't run will FAIL honestly. We chose that over the alternative — a check that silently escapes to native sh — but the cost is real: sometimes the work is fine and the check isn't.

Diagnostics carry scope, not line numbers. Rung one names the component that's wrong; it does not point a cursor at the offending line. The parent lesson's phrasing over-reached there, and the precise truth is: a name, and a message.

The kernel validates two things, not twenty. No cycle detection, no “DONE resting on a TODO”, no :DEPENDS:-points-at-nothing. Some of that is by design rather than omission: the runtime's execution model is a tree — nesting plus :ORDERED: — which cannot express a cycle at all. That's correctness by construction, not a checker you can lean on. Where the structure can't hold a bug, there's nothing to detect.

Conformance is a demoed seam, not a universal gate. The check is real and exercised by its demo; presenting it as wired into every build path would overclaim. Treat it as the contract the build should enforce, demoed end-to-end, with universal wiring still in front of it.

One CLI's lint is a stub. The Elixir CLI's wb lint maps to a kernel lint that returns an empty list today; the real checker is validate, reached by the Rust wbx lint. Use wbx lint when you want the diagnostics this page describes.

questions people actually ASK

What happens if I write no check at all?

The task reaches DONE on the agent's word. Acceptance is opt-in per task: a :done-when: command or a #+begin_src sh :check block gates a leaf; with neither, the runner trusts the worker. The check is the difference between a claim and a fact — no check, just a claim.

Can a check run arbitrary commands?

Only commands the WASM shell carries in its registry. Author-supplied check strings never touch native sh — they'd be an exec vector if they did. That's a deliberate constraint, not a limitation to route around: the sandbox is the safety.

Why did my task FAIL when the work looks done?

Almost always fail-closed in action: the check command isn't one the WASM shell knows, or it errored, or it didn't print its success sentinel. An unknown command, a raised exception, a missing file — all read as FAILED, never as “probably fine”. Check that your :done-when: uses a registered command and a path under the preopened workdir.

Does validation catch dependency cycles?

No — and it doesn't need to. The kernel makes two plan checks (dangling input, missing source block), not a cycle detector. The execution model is a tree, so a cycle can't be expressed in the first place. It's correctness by construction rather than by check — which means the absence of a cycle detector is a design answer, not a gap.

Is wbx bundle failing on diagnostics a bug?

No — it's the gate. Bundling refuses to assemble a workbook that doesn't lint clean; the error source has diagnostics — fix them first is the feature working. Run wbx lint, read the named scopes, fix them, and the bundle proceeds.

Where do FAILED results go?

Two places. The headline's state in the plan file becomes FAILED, and a row lands in _telemetry.db's task_events table — run id, task id, title, state, output, and timestamp. So “did this ever pass?” is a SQL query, not an archaeology project.

keep GOING

Validation is the honesty layer under three ideas — start with its parent.

Workflows, the plansthe promise this page mechanizes

→

Org, the grammarstructure is checkable, prose is not

→

Agents, the agentswho writes DONE, and why it's gated

→

The Autopoeta standing workflow living by these gates

→