works on my MANIFEST
You just built a toolkit, or imported a stranger's. The parent lesson told you a stranger's toolkit is safe to install — sandboxed, capability-gated, signed. That answers exactly one question: can this hurt me? It does not answer does it work, and it certainly doesn't answer the question that actually decides a toolkit's worth — can an agent use it from the manual alone?
Classic software answers all of this with a test suite that travels in the repo and runs in CI. But a toolkit's consumer has no CI, and isn't a programmer who reads your test code. Worse, a toolkit's primary user isn't a person at all — it's an agent reading prose. The manual is the interface. So a passing unit test of the binary proves the wrong thing: it proves the muscle works, not that the instructions teach an agent to flex it.
The missing piece is a quality check shaped like the thing it's checking — one that travels inside the toolkit, runs without a build server, and tests the manual as hard as the machine. That's what this lesson is about.
two GATES
1. the two gates a toolkit must pass: verify — a structural check that the manifest's promises are satisfiable right now; and eval — a behavioral check that runs the toolkit and judges the outcome. Both gates travel inside the toolkit directory, so the author proves them and any consumer can rerun them.
Quality is two different questions, so there are two commands.
wbx toolkit verify asks is the contract satisfiable —
does it load. wbx toolkit eval asks does it do the right
thing. The split is canon: eval is the third leg beside the skills and
the CLI — the author proves the toolkit does what its skills claim, and a
consumer reruns those same proofs against their own runtime before trusting
it.
| question asked | command | what runs | what can fail | |
|---|---|---|---|---|
| verify | is the contract satisfiable? | wbx toolkit verify <id> | a ✓/✗ board, one line per check — no agent, no model | missing files · unbuildable exec · ungrantable caps · bad signature |
| eval | does it do the right thing? | wbx toolkit eval <id> | one case per evals/*.org — an agent runs, a judge scores | no LLM key · wrong result · misleading manual · no task and no eval block |
the structural BOARD
Verify reads the toolkit's manifest.org, parses the
descriptor out of its #+ keywords, and runs four families of
check against it: files present, the exec mode, the declared capabilities,
and — for third-party toolkits — the signature. The output is a literal
board, one ✓ or ✗ per line. Here's a real one:
$ wbx toolkit verify glyphs ✓ manifest.org present ✓ skills/overview.org present ✓ exec: command — glyphs not yet registered, build descriptor present (run `wb toolkit build`) ✓ caps declared, all grantable: vfs net ✓ granted by profile(s): network, posix ✓ pre checks DISABLED (2 block(s); native :role bash execution removed — wb-9ja)
Read it top to bottom. The first two lines are presence: a toolkit with
no manifest and no skills/overview.org isn't a toolkit, it's a
folder. The third is the exec check, and it branches on what the
manifest declares its EXEC mode to be — and this is where most red lines
come from.
flowchart TD m["manifest.org
#+EXEC · #+CAPS · #+BUILD_SRC · #+TRUST"] m --> files{"files present?"} files -->|no| xf["✗ manifest / overview missing"] files -->|yes| exec{"exec mode?"} exec -->|command| cmd{"CLI_BIN registered
or BUILD_SRC buildable?"} exec -->|posix| px{"bin on PATH?"} exec -->|none| disc["✓ discovery-only toolkit"] cmd -->|yes| okc["✓ runnable / build-ready"] cmd -->|no| xc["✗ not registered, no buildable source"] px -->|yes| okp["✓ found on PATH"] px -->|no| xp["✗ not on PATH"] exec --> caps{"every cap grantable
AND one profile grants all?"} caps -->|yes| okcap["✓ granted by profile(s)"] caps -->|no| xcap["✗ caps not grantable / no profile covers the set"] caps --> trust{"#+TRUST: third-party?"} trust -->|first-party| fp["✓ location-trust — skip provenance"] trust -->|third-party| sig{"AUTHOR_DID + valid SIGNATURE?"} sig -->|yes| oks["✓ provenance verified"] sig -->|no| xs["✗ no_signature / bad_signature"] style m fill:#f3c5a3,stroke:#121316,stroke-width:2.5px style okc fill:#13d943,stroke:#121316 style okp fill:#13d943,stroke:#121316 style okcap fill:#13d943,stroke:#121316 style oks fill:#13d943,stroke:#121316 style disc fill:#13d943,stroke:#121316 style fp fill:#13d943,stroke:#121316 style xf fill:#f3c5a3,stroke:#121316 style xc fill:#f3c5a3,stroke:#121316 style xp fill:#f3c5a3,stroke:#121316 style xcap fill:#f3c5a3,stroke:#121316 style xs fill:#f3c5a3,stroke:#121316
Walk the exec branch, because the manual reads it line by line. A
toolkit with no EXEC is discovery-only — it ships skills, not a
command — and verify says so and passes. A command toolkit passes if
its CLI_BIN is already in the command registry, or if
it isn't yet but its #+BUILD_SRC is buildable (a crate or a
path the build lane can compile) — in which case the line tells you the
fix: run `wb toolkit build`. A posix toolkit just
checks the binary is on PATH. The failures are verbatim and self-correcting:
✗ exec: command — <bin> not registered and no buildable #+BUILD_SRC ✗ caps NOT grantable by any profile: … ✗ no single Policy profile grants all declared caps (…) ✗ third-party provenance: no_signature (sign with `wb toolkit sign <id>`)
The last board line is the honest one, and it gets its own section
later: pre checks DISABLED. A toolkit may carry :role
pre setup blocks; verify counts them but never runs them,
because native bash execution was removed from the whole runtime. It says
so out loud rather than printing a green check it didn't earn.
grantable, and grantable TOGETHER
depth rung · skippable — the capability cross-check
The cap check is stricter than it looks, and the strictness is the point. It isn't enough that every capability a toolkit declares is known — it must be grantable, and the whole declared set must be grantable by a single profile. A toolkit is instantiated under one Policy profile; if no single profile grants every capability it imports, the toolkit can never start — it would reach for a capability no profile provides. Verify catches that at the contract stage instead of at a confusing runtime crash, and on success it names the profile or profiles that cover the set.
There are four real profiles, and they're a ladder of memory, time, and reach:
| profile | memory | wall clock | capabilities granted |
|---|---|---|---|
| minimal | 64 MiB | 5 s | vfs commands exec kv secrets queue tcp udp tls |
| network | 128 MiB | 30 s | …minimal + net llm browse |
| posix | 256 MiB | 60 s | …network + posix parallel |
| compute | 64 MiB | 5 s | vfs only |
Read the table as nested reach, not four arbitrary buckets. A toolkit
that declares vfs net can't run under minimal —
no network there — but network and posix both
grant the whole set, which is exactly what the green
granted by profile(s): network, posix line reported above. A
toolkit declaring a capability no profile lists fails the
grantable half; one declaring two capabilities that no single
profile holds together fails the grantable together half. Both are
caught before a single instruction runs.
the signature LINE
For a first-party toolkit — one living in your own repo — verify
skips provenance entirely: location-trust is fine for code you control. The
moment a manifest declares #+TRUST: third-party, the rules
change. It must carry an #+AUTHOR_DID and an
#+SIGNATURE: an Ed25519 signature over the manifest with its
signature lines stripped and trimmed. Verify recomputes that canonical body
and checks the signature against the author's key, reporting any of
no_author_did, no_signature,
bad_signature, or bad_signature_encoding.
But provenance isn't only a verify line — it's a build gate. The
build command refuses outright to build an unsigned or tampered third-party
toolkit: REFUSED — third-party toolkit with invalid provenance.
The supply-chain boundary is real here: the toolkits root is an
unauthenticated, writable directory, so a toolkit dropped into it is
untrusted input until its signature proves who wrote it.
sequenceDiagram
participant A as author
participant T as the toolkit dir
participant C as consumer
A->>T: wb toolkit sign — appends AUTHOR_DID + SIGNATURE
Note over T: manifest now carries Ed25519 over its own body
C->>T: wbx toolkit verify — recompute canonical body, check sig
alt signature valid
C->>T: wb toolkit build — proceeds
else missing / tampered
C--xT: REFUSED — invalid provenance
end
Read the exchange as a chain of custody. The author signs once; the consumer verifies the same bytes and only then is allowed to build. Tamper with one character of the manifest and the canonical body no longer matches the signature, so the build refuses — not by policy you could toggle, but by arithmetic. The full signature story lives in the planned trust sibling; here it's enough that verify reports it and build enforces it.
the toolkit ships its own EXAM
Verify proves the contract is satisfiable. It says nothing about whether
the toolkit does anything. That's eval — and the shape of an eval
is the whole idea. Each case is a single org file under
evals/*.org in the toolkit directory, glob-sorted, with
symlink escapes filtered out. One file, one case. No suite means a plain
message — no eval suite (add evals/*.org) — not a false
pass.
The loop this enables is the point of the lesson. The author writes the cases and proves them green. The consumer reruns the exact same files against their own runtime before trusting the toolkit. The exam travels with the artifact, the way a test suite should — except this one doesn't need you to read code, because the cases are data, not code.
The git toolkit ships the real exemplar suite — two cases, one of each tier. The deterministic one is honestly switched off today (more on that in the honesty section); the agent-and-judge one is live. Here's the live case, whole — this is the actual shipped file:
#+TITLE: git toolkit — agent explains git status (Tier 2: agent + judge)
* the agent explains what git status shows :eval:
:PROPERTIES:
:TASK: In one sentence, explain what the command `git status` shows.
:RUBRIC: The answer says git status reports the state of the working tree
— mentioning at least staged/unstaged changes and/or untracked files.
:MAX_STEPS: 2
:END:
That's the entire test. A headline tagged :eval:, a
:TASK: the agent must complete, a :RUBRIC: the
judge scores against, and a step budget. No assertions, no harness code —
the case dispatches on what it carries: a :TASK: makes it a
Tier 2 agent-and-judge case; a :role eval block instead makes
it Tier 1 deterministic; neither, and it fails with no :role eval
block and no :TASK:.
agent and JUDGE
Here is the centerpiece. The behavioral test for a toolkit whose
interface is its manual is not a unit test of the binary — it's an agent
that has only the manual. When eval runs a Tier 2 case, it spawns a
fresh agent, and the first thing it does is inject the toolkit's own
skills/overview.org onto the agent's system prompt. The agent
is told, in effect: use this toolkit to complete the task; state your
final result clearly. Then it hands over the :TASK: and
lets the agent run.
sequenceDiagram participant H as eval harness participant A as a fresh agent participant Tk as toolkit tools participant J as judge model H->>A: system = default + injected overview.org H->>A: task = the case's :TASK: (max_steps default 6) A->>Tk: reaches for the toolkit's verbs Tk-->>A: outputs · exit codes · errors A->>A: done(result) — the only escape hatch Note over H: harvest telemetry — steps, uniq tools, error count H->>J: TASK · RUBRIC · TELEMETRY · result (first 4000 chars) J-->>H: verdict — first line PASS or FAIL · one line reason Note over H: ✓ name [steps:N tools:a,b errs:N] — reason
Follow the run as a story. The agent gets the task and the injected
manual, and nothing else — its only way to finish is to call
done(result). Its step budget defaults to 6 (the case
can override with :MAX_STEPS:), and its model comes from
WB_EVAL_MODEL. As it works, the harness records every step:
which tools it touched, and how many steps errored. That becomes the
telemetry — steps, the unique tools used,
and an errors count — and it's handed to the judge alongside
the result, so the judge factors execution, not just text. An
agent that produced the right sentence but thrashed through six failed tool
calls reads differently to the judge than one that answered cleanly.
The toolkit ships its own exam, and the exam tests the
manual as much as the muscle. A toolkit whose
overview.org misleads the agent — names a verb that doesn't
exist, omits the one that does — fails its own evals, because the only
thing the agent knows about the toolkit is what that manual told it. That's
the property no unit test has: it grades your documentation.
what the judge SEES
depth rung · skippable — the judge's exact view, so you can write rubrics that work
The judge is a second model with one job. Its system prompt is verbatim: You are a strict evaluator. Given a TASK, a RUBRIC, and an agent's RESULT, decide if the result satisfies the rubric. Respond with a verdict whose FIRST line is exactly PASS or FAIL, then one short line of reasoning. The user message it receives is a fixed template — and seeing it is what lets you write rubrics that actually hold:
TASK: <the case's :TASK:> RUBRIC: <the case's :RUBRIC:> AGENT TELEMETRY: steps=1, tools=[done], errors=0 AGENT RESULT: <the agent's result, first 4000 chars>
The parse is deliberately blunt: the verdict's first line, upcased, must
start with PASS — anything else is a fail. The reason is the
next line, truncated to 200 characters. If the judge model errors in
transport, the case fails closed. The full property reference for a case:
| property | default | effect |
|---|---|---|
:TASK: | — (required) | the instruction the agent must complete; its presence makes the case Tier 2 |
:RUBRIC: | "The result correctly and completely satisfies the task." | the checkable sentence the judge scores against |
:MAX_STEPS: | 6 | the agent's step budget for this case |
:EXEC: | false | true/yes/1 grants host-brokered tools (git, publish) — never native exec |
:SYSTEM: | the default eval prompt | an optional system-prompt override |
The rubric-writing rule falls straight out of that template. A rubric is
a checkable sentence about the result and the trace, not a vibe.
Because the telemetry line lists the tools the agent actually used, a rubric
that names the verbs the agent should reach for is enforceable — the judge
reads the tool names right there in AGENT TELEMETRY. Write
mentions staged and untracked state and the judge can check it;
write is a good answer and you've asked a model for a coin
flip.
the authoring DONE-bar
depth rung · skippable — the check you run while writing, before the runtime gate
Verify and eval are the runtime gates. There's an earlier,
authoring-time done-bar too — the toolkit-forge's verify-toolkit skill, the
last phase of building a toolkit. It's discipline, not a model, and it
checks the things a runtime can't: that every org source block is balanced
(each #+begin_src has its end, with comma-escaped examples not
miscounted), that every [[file:…]] see-also link resolves,
that the manifest indexes every skill exactly once with an
overview.org present, that the drawer's
:ID:/:CLI_BIN:/:STATUS: match the
#+ keywords, and that no file exceeds 800 lines — an
over-budget skill splits into a thick SKILL.org plus
references/.
The forge's governing rule is the one this whole page is built on:
don't claim verification you didn't do. A recipe that can only be
tested where the binary actually installs gets marked
#+STATUS: experimental rather than asserted as verified. And
its companion pitfall — fixing nothing, reporting everything —
says cheap fixes get fixed in place, not logged as findings. Honesty over a
clean-looking report, at authoring time exactly as at runtime.
the disabled TIER
Honesty section — and on this page it's the differentiator, so it leads rather than hides. There are two tiers of eval, and one of them is switched off on purpose.
Tier 1 is the deterministic tier: a case with a :role
eval bash block and an #+EXPECT: string, which would
pass on exit 0 with the expected substring in stdout. It is disabled.
Native bash execution was removed from the entire runtime, so a Tier 1 case
reports as skipped — DISABLED (native :role bash eval removed —
wb-9ja) — and verify's pre checks DISABLED line is the
same removal showing its face on the structural side. The shipped git
toolkit's version.org case is exactly this: a real, correct
test that today prints a skipped dot rather than a fabricated green. The
lane returns as sandboxed WASM commands, not native
bash — that's the honest future, and the code is the truth, not the older
docs that still describe Tier 1 as live.
So a full run today looks like this — one tier skipped by design, one tier doing the real work:
$ wbx toolkit eval git git evals: 1/1 passed (1 skipped) · evals/version.org: DISABLED (native :role bash eval removed — wb-9ja) ✓ evals/explain.org [steps:1 tools:done errs:0] — PASS — names staged/unstaged and untracked state.
The other honest limits. Tier 2 judging is probabilistic — a
strict prompt, a PASS/FAIL-first-line protocol, and telemetry grounding
reduce judge flakiness but don't eliminate it; a borderline result can swing.
It needs an LLM key — no OPENROUTER_API_KEY (or
WB_LLM_KEY) and the case skips rather than lying. The default
models are cheap — agent and judge fall through to
xiaomi/mimo-v2.5 unless you set WB_EVAL_MODEL or
WB_LLM_MODEL. And the parts bin is young: the shipped exemplar
suite is the git toolkit's two cases. None of that is hidden behind a green
check — the gate would rather say DISABLED or SKIPPED out loud than
green-wash a result it didn't earn.
questions people actually ASK
Do evals run automatically when I install a toolkit?
No — they're a command you run, wbx toolkit eval <id>
(or wb dev eval). That's deliberate: the eval is yours to
rerun against your own runtime, on your own key, when you're deciding
whether to trust the toolkit. The author proved it; you re-administer the
same files.
Who pays for the judge tokens, and which model?
You do — eval runs on your LLM key, so no key means the Tier 2 cases
skip rather than run. Agent and judge default to
xiaomi/mimo-v2.5, chosen to be cheap; set
WB_EVAL_MODEL for the agent or WB_LLM_MODEL for
both. A handful of one-sentence tasks costs almost nothing.
Can I eval someone else's toolkit?
Yes — that's the entire point. The cases ship inside the toolkit directory, so you rerun the author's own exam against your runtime before you trust it. A toolkit you can't independently re-verify is a toolkit you have to take on faith, and this design refuses to make you.
What if my toolkit actually needs to run commands?
Set :EXEC: true on the case and the agent is granted
host-brokered tools — git, publish, and the like. What it never gets is
native OS exec: that escape hatch was removed runtime-wide. The agent's
only way out is still done(result), and the host brokers
everything else.
Why did my case fail with "no :role eval block and no :TASK:"?
Because the case carries neither trigger. A case needs a
:TASK: to be a Tier 2 agent-and-judge case, or a :role
eval block to be a Tier 1 deterministic one. With neither, the
harness can't tell what you're testing, so it fails the case rather than
guessing.
Is a PASS from the judge actually trustworthy?
Trustworthy enough to gate on, not infallible. The strict system prompt, the first-line PASS/FAIL protocol, and the telemetry the judge reads alongside the result all narrow the room for a sloppy verdict — and because the cases are plain text you can read, you can audit a surprising PASS by reading the rubric and tightening it. It's a strong probabilistic check, presented as exactly that.
keep GOING
Verification is the gate on all three toolkit layers — start with the parent, then the siblings it leans on.