the string that RUNS
Every agent stack eventually arrives at the same dirty moment. The model
decided to run jq, then grep, then something with the
user's filename in it — and somewhere underneath, all of that gets glued into a
single string and handed to a shell. The shell is the most powerful interpreter
on the machine, and you just let a language model write programs for it out of
text it read on the internet.
The industry's answer has been the blocklist: scan the string for
;, for &&, for rm -rf, for
backticks, and refuse the ones that look dangerous. This is an apology, not a
defense. A regex over a shell grammar is a regex over a grammar built to be
re-quoted, base64'd, $IFS-smuggled and nested twelve ways. You're
not winning that fight; you're announcing which round you lost.
So this page asks a different question. Not how do we sanitize the command line — but what if the command line were never a string at all?
the DEFINITION
1. a CLI converted to a runnable WASM module
— stdin in, stdout out — that a workbook or agent invokes
by name through the run-command Dock import. One name, one
content-addressed artifact, one argument convention.
A toolkit is the whole bundle: a progressive-disclosure skill doc bundled with the CLI it documents. The command is what that CLI becomes — its runnable form. The manual tells an agent which commands exist; the commands are the verbs the manual is teaching. A capability need — network, a directory, more memory — isn't a separate layer underneath; it's a property of the command, declared and granted, nothing more.
a PATH with no operating system
There is no /usr/bin here, and no $PATH to scan.
There's a registry — an Elixir module, CommandRegistry — and looking
up a name walks two layers. First a dynamic map (everything agents and toolkits
have registered this session, kept in :persistent_term). Then the
static built-ins, merged last so they always win the lookup.
That merge order is a security decision, not an accident. The built-ins —
upper, jq, grep, wbox — are
reserved and unshadowable. A hostile dynamic registration cannot rebind
jq to its own bytes for every Instance on the engine; the trusted
name resolves to the trusted artifact, always. It's a supply-chain defense
expressed as a sort order.
flowchart LR
call["run-command jq"]
subgraph reg["CommandRegistry — lookup"]
direction TB
dyn["dynamic map
session registrations
:persistent_term · cap 4096"]
bi["built-ins
upper · jq · grep · wbox
RESERVED · merged last"]
end
art["build/commands/<sha>.wasm
content-addressed artifact"]
call --> dyn
dyn -- "overlaid by" --> bi
bi -- "trusted name wins" --> art
style reg fill:#fbfaf6,stroke:#121316
style dyn fill:#f3c5a3,stroke:#121316
style bi fill:#13d943,stroke:#121316,stroke-width:2.5px
style art fill:#f2ddb0,stroke:#121316
style call fill:#ffffff,stroke:#121316
Read the graph as one trip. A call for jq hits the dynamic map
first, but the built-ins are laid over the top, so the reserved jq
is the one that answers — and what it answers with is a path into the
content-addressed store, build/commands/<sha>.wasm, where the
artifact is named by the hash of its own bytes. A few guardrails ring the
registry: names must match ^[A-Za-z0-9_.-]+$, the dynamic map is
capped at 4096 entries so a flood can't exhaust persistent-term memory,
and any registered path must canonicalize strictly inside that one store
directory — a path that points anywhere else is refused before it can run.
And names hot-swap. Re-register a name and the binding is replaced in place — no restart, nothing reboots. Calls already in flight finish on the old bytes (the old sha file is never deleted out from under them); the next call observes the new binding. The registry is a living thing, edited while it runs, which is exactly the property the autopoet needs to fix a tool mid-session.
what one invocation COSTS
Invoking a command is not a function call into a shared process. It is a
fresh wasmtime subprocess, born for this one run and gone after it — the
:os_process isolation tier, earned by shape, not by a toggle. The
envelope it runs inside is fixed and small:
| limit | constant | what it stops |
|---|---|---|
| wall clock | -W timeout=30000 | a command that hangs is trapped at 30s, not forever |
| fuel | -W fuel=5_000_000_000 | a tight infinite loop burns fuel and traps even if it never blocks |
| stdin size | @max_input_bytes 64MB | an oversized pipe is a clear error, not an OOM |
| argv size | @max_argv_bytes 256KB | used to blow ARG_MAX and fail silently — now a stated error |
| file access | preopened dirs only (--dir host::guest) | no path resolves outside what was explicitly handed in |
The table's verdict in one line: a command can fail, loop, or overflow, and
every one of those is a bounded, named outcome instead of a way out. Two of the
exit behaviours are worth saying aloud. A non-zero exit is usually just the
guest CLI's normal failure — grep found nothing, jq
hit a parse error — so the output returns verbatim and the status is preserved,
which is what lets a shell && or || downstream
make sense. That's the universal-CLI contract, kept honest. And the run uses a
shared wasmtime compile cache, so the second invocation of python
doesn't re-pay the JIT.
the six NO's
Between an agent saying run-command and any wasm starting stands
the ExecBroker — and its first principle is the one most stacks skip:
there is no real OS exec here. Only registered wasm commands run. Every
request walks a deny ladder, in this exact order, and every denial is audited:
sequenceDiagram participant G as guest (the agent) participant B as ExecBroker participant R as CommandRegistry participant W as wasmtime subprocess G->>B: host_exec(name, argv, stdin) Note over B: 1 · revoked? → :revoked Note over B: 2 · over quota? → :rate_limited (2k/sec floor) Note over B: 3 · allow == false? → :denied (DEFAULT-DENY) Note over B: 4 · depth > 8? → :max_depth Note over B: 5 · in allowlist? → :command_not_granted B->>R: lookup name Note over R: 6 · registered? → :unknown_command R-->>B: artifact path B->>W: argv + stdin (structural) W-->>B: stdout · exit status Note over B: output capped at 8MB B-->>G: result (or audited deny)
Walk the ladder as a story. First the principal: if this tenant's been
revoked, stop — :revoked. Second the meter: the rate limiter's
default floor is two thousand broker calls per second per tenant; past that,
:rate_limited. Third the default: allow is
false until something grants it — default-deny, :denied.
Fourth the depth: a chain deeper than eight is :max_depth. Fifth
the allowlist: this Instance was granted a specific set of command names, and a
name outside it is :command_not_granted. Sixth and last, after all
that, the registry itself: a name nobody ever registered is
:unknown_command. Six refusals before bytes execute, each one
logged.
And one design choice underneath all six: exec is its own
capability, distinct from commands, vfs, or
networking. A profile can grant durable storage or the network without
ever granting the ability to spawn a command — because the ungranted
exec import is never even merged into the guest's import table. An
un-granted cap isn't un-allowed; it's un-importable, which makes it
un-callable.
argv is a list, not a SENTENCE
depth rung · skippable — the wire format, byte by byte
Here is the whole trick, in one diagram. When an agent runs
grep "ada; rm -rf /", the broker never builds a string. It writes a
length-prefixed, little-endian structure and the worst the dangerous part can do
is fail to match:
agent runs: grep "ada; rm -rf /" < users.txt
wire bytes: [name_len:u32][name][argc:u32][(arg_len:u32)(arg)]*[stdin_len:u32][stdin]
─────────────────────────────────────────────────────────────────────
[4] g r e p [1] [12] a d a ; r m - r f / [N] <users.txt bytes…>
│ │ │ │ └── one argument. the ; is byte 0x3B, not syntax.
│ │ │ └── arg_len, little-endian u32
│ │ └── argc = 1 argument
│ └── the command name
└── name_len, little-endian u32
There is no shell to interpret that semicolon. It crosses the boundary as the
bytes of one discrete argument; wasmtime receives a clean argv list of
exactly two elements (grep and the literal needle); and the most
hostile thing ada; rm -rf / can accomplish is to not appear in
users.txt. The classic injection class isn't mitigated here. It's
unrepresentable — there's no field in this structure where a metacharacter
could mean anything but itself. No delimiter, so no delimiter ambiguity.
Two argument conventions ride this same wire. Most commands use
:argv — real wasmtime argv, the universal CLI ABI. A couple of
legacy interpreters (jq, grep) use
:stdin1, where the filter or pattern is folded into the first
line of stdin rather than passed as argv — an older protocol the registry
still honors so those tools behave as their authors expect. Either way, the
argument never becomes shell text.
the shell that ISN'T one
Agents still want to type jq '.users[].name' | grep ada | upper
and have it work. They can — and there is still no shell. Workbooks.Shell
is an Elixir pipeline evaluator: it parses the line, then runs each stage as a
registered WASM command, piping stdout into the next stage's stdin in memory.
No OS process, no fork, just wasmtime per stage.
flowchart LR src["jq '.users[].name' | grep ada | upper"] s1["jq
:stdin1 — filter on line 1
fresh wasmtime · fuel 5e9 · 30s"] s2["grep ada
:stdin1 — pattern on line 1
fresh wasmtime"] s3["upper
:argv — raw stdin
fresh wasmtime"] out["stdout"] src --> s1 s1 -- "bytes piped in memory · trim:false" --> s2 s2 -- "byte-exact · no fork" --> s3 s3 --> out style s1 fill:#f3c5a3,stroke:#121316 style s2 fill:#f2ddb0,stroke:#121316 style s3 fill:#aee5c2,stroke:#121316 style out fill:#13d943,stroke:#121316,stroke-width:2.5px style src fill:#ffffff,stroke:#121316
Follow that pipeline once, because it's the worked example that proves the
whole construct. The shell splits the line quote-aware, so jq's
inner | inside its quoted filter stays inside its stage and never
reads as a pipe. Stage one is jq in :stdin1 mode, so
the filter becomes the first line of stdin ahead of the JSON. It runs in a fresh
wasmtime process — fuel five billion, thirty-second wall clock — and its stdout
is piped on byte-exact, with trailing newlines untouched, because trimming
a \n here would make a downstream wc -l lie. Stage two
is grep, pattern on line one. Stage three is upper,
argless, raw stdin. Three wasmtime processes, zero shell forks, and the exit
status of the last stage is what drives any && that follows.
The shell speaks a deliberate subset: ;, &&,
|| with real short-circuiting on exit status; NAME=value
variables with $NAME / ${NAME} expansion (an unknown
name is left literal, so jq's own $var
survives); and <, >, >>
redirection confined to preopened directories — a path outside the sandbox is a
clean {:outside_sandbox, file} error. Two tokens are intentional
no-ops, not errors: 2>/dev/null and 2>&1 are
swallowed silently, because choking on them was once the single biggest source
of per-run thrash. And wbox — one multicall binary compiled from C —
supplies the everyday vocabulary as applets: cat echo seq head wc nl rev
basename dirname tr sort uniq tail true false, dispatched as
wbox <applet>. Bash ergonomics, zero forks.
sixty-some programs, each a row of DATA
None of this would matter if the command set were a toy. It isn't. Around sixty-four real programs are seeded and live, each one proved by actually running it under wasmtime — and each one is, fundamentally, just data: a name, a URL, a sha256 pin, a wasm path, an arg mode. The lane is the registry mechanism; the tool is the row.
| lane | trust gate | examples (a sample of ~64) |
|---|---|---|
| A — prebuilt fetch | sha256 pin (inert bytes until run) | python (CPython 3.12), ruby, php, pandoc, openssl, sqlite3, prolog, wabt → 12 tools (wat2wasm, wasm2wat, wasm-validate…) |
| B — C source built in-sandbox | sha-pinned tarball → clang.wasm | lua, zstd, sodium, pcre2, png, harfbuzz, freetype, qjs (27 names) |
| inline — self-authored | Instance Policy profile caps | Rust (wfreq, rgx), Python tools (yaml, tmpl), JS-npm (pdf, arrow, protobuf) |
The verdict of that table: trust scales with origin, not with effort. A
prebuilt is trusted exactly as far as its hash, which is why a sha pin is the
whole gate — a .wasm someone hands you is inert until it runs in the
sandbox. A C-source tool is trusted by the pinned tarball it was built from,
compiled by a wasm clang with no native toolchain anywhere. And an inline
command is trusted by the Instance profile that built it — it can hold no
capability its Instance didn't already have. Whole interpreters arrive as four
lines of data each; the WABT suite arrives as one tarball that fans out into a
dozen commands.
Here's one real row, verbatim — CPython, in the sandbox, as four lines:
%{name: "python", kind: :wasm,
url: "https://github.com/vmware-labs/.../python-3.12.0.wasm",
sha: "e5dc5a398b07b54ea8fdb503bf68fb583d533f10ec3f930963e02b9505f7a763",
mode: :argv}
Seeding happens at boot, inside a background Task, so fetching the catalog never blocks the engine from starting. By the time an agent asks, the rows are already in the registry.
bytes that prove THEMSELVES
depth rung · skippable — the supply chain, end to end
A name resolving to bytes is only safe if the bytes can't lie. So every lane funnels through the same content-addressing spine, and the artifact is re-checked at the last possible moment:
flowchart LR src["fetch (Erlang TLS, no curl)
or build (clang.wasm)"] pin["sha256 the bytes"] ca["write build/commands/<sha>.wasm
idempotent · same bytes, same path"] reg["register name → path
+ manifest registry.json"] run["RUN: re-hash the file
mismatch → :artifact_integrity"] src --> pin --> ca --> reg --> run run -- "match" --> ok["wasmtime executes"] style pin fill:#f3c5a3,stroke:#121316 style ca fill:#f2ddb0,stroke:#121316 style run fill:#13d943,stroke:#121316,stroke-width:2.5px style ok fill:#aee5c2,stroke:#121316
Trace the spine. Bytes arrive — fetched over a pure-Erlang TLS GET that
verifies against the OS trust store (there is no curl binary
anywhere; a prebuilt fetch is Erlang and TLS, full stop), or built in-sandbox by
a wasm compiler. They're sha256'd, and that hash is the filename:
build/commands/<sha>.wasm. The same bytes always land on the
same path, so registration is idempotent. The name-to-path binding is recorded
in a manifest, registry.json, so a reboot re-registers everything
with no re-fetch and no rebuild. And then the closure that matters: at run
time, the content-addressed file is re-hashed and refused on mismatch —
{:artifact_integrity, path} — which closes the time-of-check /
time-of-use gap that a path-based store would leave wide open.
The same posture extends outward. A workbook bundle
carries its commands as <name>.wasm parts;
Library.install sha-pins the bundle and registers each one —
installing a bundle never widens capabilities. A third-party toolkit
with invalid provenance refuses to build at all. And a toolkit whose declared
CLI_BIN collides with a reserved name is rejected before
compile, because that name is attacker-controlled data and the reserved set wins
there too. The native escape hatches that other stacks lean on —
cargo install, go get, native zig cc —
return explicit :lane_unavailable errors here rather than shelling
out. wasmtime running a .wasm is the architecture. Native binaries
are banned.
the honest EDGES
Where this construct stops, plainly. The shell is a subset, on
purpose: no globbing, no subshells, no command substitution, and stderr isn't
real plumbing — 2>/dev/null is a swallowed no-op, not a
redirection. If you need POSIX shell theatre, this isn't it; if you need a safe
way to chain wasm commands, it is.
The banned native lanes are a choice, not a gap. cargo install,
go get, and native zig cc return
:lane_unavailable rather than quietly shelling out to a real
toolchain — we'd rather hand you an honest error than a hidden escape. The
same honesty applies to depth: the broker's eight-level recursion bound is
plumbed and tested (a depth-99 call returns {:error, :max_depth} in
the suite), but an ordinary plain-wasmtime command has no host_exec
import at all — only Dock-harness artifacts can nest. So we won't claim live deep
command chains beyond what actually ships; the bound is real, the nesting it
bounds is narrow today.
And cold builds are slow. A C-source tool compiled by wasm clang the first time is not instant — but it's content-addressed, so it's a one-time cost; the second run is a hash lookup. Some software, finally, just fights the crossing — not everything that compiles to a native binary compiles cleanly to WASI, and when it doesn't, we say so rather than ship a broken row.
questions people actually ASK
Is this bash?
No — and that's the entire point. There's no shell process and no command line; the "shell" is an Elixir pipeline evaluator and each stage is a separate wasmtime subprocess. Arguments cross as a length-prefixed structural list, so a metacharacter is just a byte in an argument. You get pipeline ergonomics without ever building a string a shell could interpret.
Why did my command not see my file?
A command can only reach preopened directories, mapped in
as host::guest with --dir. If you didn't preopen the
directory the file lives in, there's no path inside the sandbox that resolves
to it — by construction, not by a setting. Grant the directory and it appears.
Can a hostile command hijack jq?
No. The built-in names — jq, grep,
upper, wbox — are reserved and merged last in
the registry, so a dynamic registration can never shadow them. A toolkit that
even declares a CLI_BIN colliding with a reserved name is refused
before it compiles.
What if it infinite-loops?
Two independent traps catch it. A wall-clock timeout of 30 seconds, and a fuel budget of five billion instructions — so a loop that never yields still runs out of fuel and traps. There's no version of "spins forever" that survives both.
Can I run my favorite CLI?
Probably — in order of effort: if there's a prebuilt WASI build, pin its
sha and it's four lines of data. If it's C source, point the C-source lane at
a pinned tarball and wasm clang builds it in-sandbox. If neither, write a small
command yourself with build-inline. What you can't do is shell out
to a native binary — that lane returns :lane_unavailable on
purpose.
What's the difference between the commands and exec caps?
commands is the broad ability to work with the command surface;
exec is a dedicated, least-privilege cap that specifically governs
spawning a command. They're separate so a profile can grant storage or
networking without granting the power to run commands — and when
exec isn't granted, its import isn't even bound into the guest, so
it's not a permission to refuse, it's a call that can't be made.
keep GOING
Commands are the muscle of a toolkit — the layers around them are where they get their reach and their limits.