learn / 03·6 — under toolkit · verification

two gatesEVERYtoolkit passes

A toolkit ships with its own exam. Verify asks whether the contract is satisfiable — files present, exec mode real, declared capabilities grantable, signature valid. Eval asks whether the thing actually works — it runs an agent that has only the manual, and a judge scores the result. The author proves the claim; a consumer reruns the same files before trusting it.

the gates12 min read
A small inspector at the foot of two towering luminous gateways set in sequence — the first stamped with a checklist of glowing marks, the second a great judging eye over an arena where a tiny figure performs a task — 1970s sci-fi style, bright and monumental against the lone figure

works on my MANIFEST

You just built a toolkit, or imported a stranger's. The parent lesson told you a stranger's toolkit is safe to install — sandboxed, capability-gated, signed. That answers exactly one question: can this hurt me? It does not answer does it work, and it certainly doesn't answer the question that actually decides a toolkit's worth — can an agent use it from the manual alone?

Classic software answers all of this with a test suite that travels in the repo and runs in CI. But a toolkit's consumer has no CI, and isn't a programmer who reads your test code. Worse, a toolkit's primary user isn't a person at all — it's an agent reading prose. The manual is the interface. So a passing unit test of the binary proves the wrong thing: it proves the muscle works, not that the instructions teach an agent to flex it.

The missing piece is a quality check shaped like the thing it's checking — one that travels inside the toolkit, runs without a build server, and tests the manual as hard as the machine. That's what this lesson is about.

two GATES

ver·i·fi·ca·tion /ˌver·ə·fə·ˈkeɪ·ʃən/ noun

1. the two gates a toolkit must pass: verify — a structural check that the manifest's promises are satisfiable right now; and eval — a behavioral check that runs the toolkit and judges the outcome. Both gates travel inside the toolkit directory, so the author proves them and any consumer can rerun them.

Quality is two different questions, so there are two commands. wbx toolkit verify asks is the contract satisfiable — does it load. wbx toolkit eval asks does it do the right thing. The split is canon: eval is the third leg beside the skills and the CLI — the author proves the toolkit does what its skills claim, and a consumer reruns those same proofs against their own runtime before trusting it.

question askedcommandwhat runswhat can fail
verifyis the contract satisfiable?wbx toolkit verify <id>a ✓/✗ board, one line per check — no agent, no modelmissing files · unbuildable exec · ungrantable caps · bad signature
evaldoes it do the right thing?wbx toolkit eval <id>one case per evals/*.org — an agent runs, a judge scoresno LLM key · wrong result · misleading manual · no task and no eval block

the structural BOARD

Verify reads the toolkit's manifest.org, parses the descriptor out of its #+ keywords, and runs four families of check against it: files present, the exec mode, the declared capabilities, and — for third-party toolkits — the signature. The output is a literal board, one ✓ or ✗ per line. Here's a real one:

$ wbx toolkit verify glyphs
✓ manifest.org present
✓ skills/overview.org present
✓ exec: command — glyphs not yet registered, build descriptor present (run `wb toolkit build`)
✓ caps declared, all grantable: vfs net
✓ granted by profile(s): network, posix
✓ pre checks DISABLED (2 block(s); native :role bash execution removed — wb-9ja)

Read it top to bottom. The first two lines are presence: a toolkit with no manifest and no skills/overview.org isn't a toolkit, it's a folder. The third is the exec check, and it branches on what the manifest declares its EXEC mode to be — and this is where most red lines come from.

flowchart TD
  m["manifest.org
#+EXEC · #+CAPS · #+BUILD_SRC · #+TRUST"] m --> files{"files present?"} files -->|no| xf["✗ manifest / overview missing"] files -->|yes| exec{"exec mode?"} exec -->|command| cmd{"CLI_BIN registered
or BUILD_SRC buildable?"} exec -->|posix| px{"bin on PATH?"} exec -->|none| disc["✓ discovery-only toolkit"] cmd -->|yes| okc["✓ runnable / build-ready"] cmd -->|no| xc["✗ not registered, no buildable source"] px -->|yes| okp["✓ found on PATH"] px -->|no| xp["✗ not on PATH"] exec --> caps{"every cap grantable
AND one profile grants all?"} caps -->|yes| okcap["✓ granted by profile(s)"] caps -->|no| xcap["✗ caps not grantable / no profile covers the set"] caps --> trust{"#+TRUST: third-party?"} trust -->|first-party| fp["✓ location-trust — skip provenance"] trust -->|third-party| sig{"AUTHOR_DID + valid SIGNATURE?"} sig -->|yes| oks["✓ provenance verified"] sig -->|no| xs["✗ no_signature / bad_signature"] style m fill:#f3c5a3,stroke:#121316,stroke-width:2.5px style okc fill:#13d943,stroke:#121316 style okp fill:#13d943,stroke:#121316 style okcap fill:#13d943,stroke:#121316 style oks fill:#13d943,stroke:#121316 style disc fill:#13d943,stroke:#121316 style fp fill:#13d943,stroke:#121316 style xf fill:#f3c5a3,stroke:#121316 style xc fill:#f3c5a3,stroke:#121316 style xp fill:#f3c5a3,stroke:#121316 style xcap fill:#f3c5a3,stroke:#121316 style xs fill:#f3c5a3,stroke:#121316

Walk the exec branch, because the manual reads it line by line. A toolkit with no EXEC is discovery-only — it ships skills, not a command — and verify says so and passes. A command toolkit passes if its CLI_BIN is already in the command registry, or if it isn't yet but its #+BUILD_SRC is buildable (a crate or a path the build lane can compile) — in which case the line tells you the fix: run `wb toolkit build`. A posix toolkit just checks the binary is on PATH. The failures are verbatim and self-correcting:

✗ exec: command — <bin> not registered and no buildable #+BUILD_SRC
✗ caps NOT grantable by any profile: …
✗ no single Policy profile grants all declared caps (…)
✗ third-party provenance: no_signature (sign with `wb toolkit sign <id>`)

The last board line is the honest one, and it gets its own section later: pre checks DISABLED. A toolkit may carry :role pre setup blocks; verify counts them but never runs them, because native bash execution was removed from the whole runtime. It says so out loud rather than printing a green check it didn't earn.

grantable, and grantable TOGETHER

depth rung · skippable — the capability cross-check

The cap check is stricter than it looks, and the strictness is the point. It isn't enough that every capability a toolkit declares is known — it must be grantable, and the whole declared set must be grantable by a single profile. A toolkit is instantiated under one Policy profile; if no single profile grants every capability it imports, the toolkit can never start — it would reach for a capability no profile provides. Verify catches that at the contract stage instead of at a confusing runtime crash, and on success it names the profile or profiles that cover the set.

There are four real profiles, and they're a ladder of memory, time, and reach:

profilememorywall clockcapabilities granted
minimal64 MiB5 svfs commands exec kv secrets queue tcp udp tls
network128 MiB30 s…minimal + net llm browse
posix256 MiB60 s…network + posix parallel
compute64 MiB5 svfs only

Read the table as nested reach, not four arbitrary buckets. A toolkit that declares vfs net can't run under minimal — no network there — but network and posix both grant the whole set, which is exactly what the green granted by profile(s): network, posix line reported above. A toolkit declaring a capability no profile lists fails the grantable half; one declaring two capabilities that no single profile holds together fails the grantable together half. Both are caught before a single instruction runs.

the signature LINE

For a first-party toolkit — one living in your own repo — verify skips provenance entirely: location-trust is fine for code you control. The moment a manifest declares #+TRUST: third-party, the rules change. It must carry an #+AUTHOR_DID and an #+SIGNATURE: an Ed25519 signature over the manifest with its signature lines stripped and trimmed. Verify recomputes that canonical body and checks the signature against the author's key, reporting any of no_author_did, no_signature, bad_signature, or bad_signature_encoding.

But provenance isn't only a verify line — it's a build gate. The build command refuses outright to build an unsigned or tampered third-party toolkit: REFUSED — third-party toolkit with invalid provenance. The supply-chain boundary is real here: the toolkits root is an unauthenticated, writable directory, so a toolkit dropped into it is untrusted input until its signature proves who wrote it.

sequenceDiagram
  participant A as author
  participant T as the toolkit dir
  participant C as consumer
  A->>T: wb toolkit sign — appends AUTHOR_DID + SIGNATURE
  Note over T: manifest now carries Ed25519 over its own body
  C->>T: wbx toolkit verify — recompute canonical body, check sig
  alt signature valid
    C->>T: wb toolkit build — proceeds
  else missing / tampered
    C--xT: REFUSED — invalid provenance
  end
  

Read the exchange as a chain of custody. The author signs once; the consumer verifies the same bytes and only then is allowed to build. Tamper with one character of the manifest and the canonical body no longer matches the signature, so the build refuses — not by policy you could toggle, but by arithmetic. The full signature story lives in the planned trust sibling; here it's enough that verify reports it and build enforces it.

the toolkit ships its own EXAM

Verify proves the contract is satisfiable. It says nothing about whether the toolkit does anything. That's eval — and the shape of an eval is the whole idea. Each case is a single org file under evals/*.org in the toolkit directory, glob-sorted, with symlink escapes filtered out. One file, one case. No suite means a plain message — no eval suite (add evals/*.org) — not a false pass.

The loop this enables is the point of the lesson. The author writes the cases and proves them green. The consumer reruns the exact same files against their own runtime before trusting the toolkit. The exam travels with the artifact, the way a test suite should — except this one doesn't need you to read code, because the cases are data, not code.

The git toolkit ships the real exemplar suite — two cases, one of each tier. The deterministic one is honestly switched off today (more on that in the honesty section); the agent-and-judge one is live. Here's the live case, whole — this is the actual shipped file:

#+TITLE: git toolkit — agent explains git status (Tier 2: agent + judge)

* the agent explains what git status shows                          :eval:
  :PROPERTIES:
  :TASK:    In one sentence, explain what the command `git status` shows.
  :RUBRIC:  The answer says git status reports the state of the working tree
            — mentioning at least staged/unstaged changes and/or untracked files.
  :MAX_STEPS: 2
  :END:

That's the entire test. A headline tagged :eval:, a :TASK: the agent must complete, a :RUBRIC: the judge scores against, and a step budget. No assertions, no harness code — the case dispatches on what it carries: a :TASK: makes it a Tier 2 agent-and-judge case; a :role eval block instead makes it Tier 1 deterministic; neither, and it fails with no :role eval block and no :TASK:.

agent and JUDGE

Here is the centerpiece. The behavioral test for a toolkit whose interface is its manual is not a unit test of the binary — it's an agent that has only the manual. When eval runs a Tier 2 case, it spawns a fresh agent, and the first thing it does is inject the toolkit's own skills/overview.org onto the agent's system prompt. The agent is told, in effect: use this toolkit to complete the task; state your final result clearly. Then it hands over the :TASK: and lets the agent run.

sequenceDiagram
  participant H as eval harness
  participant A as a fresh agent
  participant Tk as toolkit tools
  participant J as judge model
  H->>A: system = default + injected overview.org
  H->>A: task = the case's :TASK:  (max_steps default 6)
  A->>Tk: reaches for the toolkit's verbs
  Tk-->>A: outputs · exit codes · errors
  A->>A: done(result) — the only escape hatch
  Note over H: harvest telemetry — steps, uniq tools, error count
  H->>J: TASK · RUBRIC · TELEMETRY · result (first 4000 chars)
  J-->>H: verdict — first line PASS or FAIL · one line reason
  Note over H: ✓ name [steps:N tools:a,b errs:N] — reason
  

Follow the run as a story. The agent gets the task and the injected manual, and nothing else — its only way to finish is to call done(result). Its step budget defaults to 6 (the case can override with :MAX_STEPS:), and its model comes from WB_EVAL_MODEL. As it works, the harness records every step: which tools it touched, and how many steps errored. That becomes the telemetrysteps, the unique tools used, and an errors count — and it's handed to the judge alongside the result, so the judge factors execution, not just text. An agent that produced the right sentence but thrashed through six failed tool calls reads differently to the judge than one that answered cleanly.

The toolkit ships its own exam, and the exam tests the manual as much as the muscle. A toolkit whose overview.org misleads the agent — names a verb that doesn't exist, omits the one that does — fails its own evals, because the only thing the agent knows about the toolkit is what that manual told it. That's the property no unit test has: it grades your documentation.

what the judge SEES

depth rung · skippable — the judge's exact view, so you can write rubrics that work

The judge is a second model with one job. Its system prompt is verbatim: You are a strict evaluator. Given a TASK, a RUBRIC, and an agent's RESULT, decide if the result satisfies the rubric. Respond with a verdict whose FIRST line is exactly PASS or FAIL, then one short line of reasoning. The user message it receives is a fixed template — and seeing it is what lets you write rubrics that actually hold:

TASK:
<the case's :TASK:>

RUBRIC:
<the case's :RUBRIC:>

AGENT TELEMETRY: steps=1, tools=[done], errors=0

AGENT RESULT:
<the agent's result, first 4000 chars>

The parse is deliberately blunt: the verdict's first line, upcased, must start with PASS — anything else is a fail. The reason is the next line, truncated to 200 characters. If the judge model errors in transport, the case fails closed. The full property reference for a case:

propertydefaulteffect
:TASK:— (required)the instruction the agent must complete; its presence makes the case Tier 2
:RUBRIC:"The result correctly and completely satisfies the task."the checkable sentence the judge scores against
:MAX_STEPS:6the agent's step budget for this case
:EXEC:falsetrue/yes/1 grants host-brokered tools (git, publish) — never native exec
:SYSTEM:the default eval promptan optional system-prompt override

The rubric-writing rule falls straight out of that template. A rubric is a checkable sentence about the result and the trace, not a vibe. Because the telemetry line lists the tools the agent actually used, a rubric that names the verbs the agent should reach for is enforceable — the judge reads the tool names right there in AGENT TELEMETRY. Write mentions staged and untracked state and the judge can check it; write is a good answer and you've asked a model for a coin flip.

the authoring DONE-bar

depth rung · skippable — the check you run while writing, before the runtime gate

Verify and eval are the runtime gates. There's an earlier, authoring-time done-bar too — the toolkit-forge's verify-toolkit skill, the last phase of building a toolkit. It's discipline, not a model, and it checks the things a runtime can't: that every org source block is balanced (each #+begin_src has its end, with comma-escaped examples not miscounted), that every [[file:…]] see-also link resolves, that the manifest indexes every skill exactly once with an overview.org present, that the drawer's :ID:/:CLI_BIN:/:STATUS: match the #+ keywords, and that no file exceeds 800 lines — an over-budget skill splits into a thick SKILL.org plus references/.

The forge's governing rule is the one this whole page is built on: don't claim verification you didn't do. A recipe that can only be tested where the binary actually installs gets marked #+STATUS: experimental rather than asserted as verified. And its companion pitfall — fixing nothing, reporting everything — says cheap fixes get fixed in place, not logged as findings. Honesty over a clean-looking report, at authoring time exactly as at runtime.

the disabled TIER

Honesty section — and on this page it's the differentiator, so it leads rather than hides. There are two tiers of eval, and one of them is switched off on purpose.

Tier 1 is the deterministic tier: a case with a :role eval bash block and an #+EXPECT: string, which would pass on exit 0 with the expected substring in stdout. It is disabled. Native bash execution was removed from the entire runtime, so a Tier 1 case reports as skipped — DISABLED (native :role bash eval removed — wb-9ja) — and verify's pre checks DISABLED line is the same removal showing its face on the structural side. The shipped git toolkit's version.org case is exactly this: a real, correct test that today prints a skipped dot rather than a fabricated green. The lane returns as sandboxed WASM commands, not native bash — that's the honest future, and the code is the truth, not the older docs that still describe Tier 1 as live.

So a full run today looks like this — one tier skipped by design, one tier doing the real work:

$ wbx toolkit eval git
git evals: 1/1 passed (1 skipped)
  · evals/version.org: DISABLED (native :role bash eval removed — wb-9ja)
  ✓ evals/explain.org [steps:1 tools:done errs:0] — PASS — names staged/unstaged and untracked state.

The other honest limits. Tier 2 judging is probabilistic — a strict prompt, a PASS/FAIL-first-line protocol, and telemetry grounding reduce judge flakiness but don't eliminate it; a borderline result can swing. It needs an LLM key — no OPENROUTER_API_KEY (or WB_LLM_KEY) and the case skips rather than lying. The default models are cheap — agent and judge fall through to xiaomi/mimo-v2.5 unless you set WB_EVAL_MODEL or WB_LLM_MODEL. And the parts bin is young: the shipped exemplar suite is the git toolkit's two cases. None of that is hidden behind a green check — the gate would rather say DISABLED or SKIPPED out loud than green-wash a result it didn't earn.

questions people actually ASK

Do evals run automatically when I install a toolkit?

No — they're a command you run, wbx toolkit eval <id> (or wb dev eval). That's deliberate: the eval is yours to rerun against your own runtime, on your own key, when you're deciding whether to trust the toolkit. The author proved it; you re-administer the same files.

Who pays for the judge tokens, and which model?

You do — eval runs on your LLM key, so no key means the Tier 2 cases skip rather than run. Agent and judge default to xiaomi/mimo-v2.5, chosen to be cheap; set WB_EVAL_MODEL for the agent or WB_LLM_MODEL for both. A handful of one-sentence tasks costs almost nothing.

Can I eval someone else's toolkit?

Yes — that's the entire point. The cases ship inside the toolkit directory, so you rerun the author's own exam against your runtime before you trust it. A toolkit you can't independently re-verify is a toolkit you have to take on faith, and this design refuses to make you.

What if my toolkit actually needs to run commands?

Set :EXEC: true on the case and the agent is granted host-brokered tools — git, publish, and the like. What it never gets is native OS exec: that escape hatch was removed runtime-wide. The agent's only way out is still done(result), and the host brokers everything else.

Why did my case fail with "no :role eval block and no :TASK:"?

Because the case carries neither trigger. A case needs a :TASK: to be a Tier 2 agent-and-judge case, or a :role eval block to be a Tier 1 deterministic one. With neither, the harness can't tell what you're testing, so it fails the case rather than guessing.

Is a PASS from the judge actually trustworthy?

Trustworthy enough to gate on, not infallible. The strict system prompt, the first-line PASS/FAIL protocol, and the telemetry the judge reads alongside the result all narrow the room for a sloppy verdict — and because the cases are plain text you can read, you can audit a surprising PASS by reading the rubric and tightening it. It's a strong probabilistic check, presented as exactly that.

keep GOING

Verification is the gate on all three toolkit layers — start with the parent, then the siblings it leans on.