pausing a computer is supposed to be HARD
Everyone selling agent infrastructure sells the same expensive miracle: we can freeze your microVM. Snapshot the guest memory, track the dirty pages, ship gigabytes to object storage, meter the hibernation, and charge for the thaw. The pitch works because the assumption underneath it feels like physics — if an agent's computer is a rented virtual machine, then pausing it really is a heavyweight systems operation, and resuming it really is slow, and somebody really does have to bill you for both.
You've probably internalized that assumption without noticing. Pause a running sandbox sounds like a hard problem the way isolate untrusted code sounds like a hard problem — and the Nexus lesson already dismantled the second one: isolation here is construction, not a rented perimeter. This page runs the identical move on state. If the thing you're pausing isn't a machine — if its entire durable state is one file and the process holding it is disposable by contract — then suspension stops being a systems feat and becomes what it always should have been. A filing decision.
the five STATES
1. the running shape of a docked workbook — one WebAssembly instance under one supervised process, whose entire durable state is one SQLite file, and whose whole life is five explicit states: created → active → suspended → frozen → archived, with deleted as the only exit.
Two facts before the diagram. First, the state lives in a registry
row, not in the process — a sandbox can be deep in cold storage and
still be a first-class citizen of the system, because being a citizen
costs one row. Second, transitions are explicit. The engine's
transition/2 validates every hop against a map and answers
{:ok, to} or {:error, :invalid} — there are no
implicit shortcuts. Here is the entire state machine, as shipped, from
runtime/host/lifecycle.ex:
@transitions %{
created: [:active],
active: [:suspended, :archived],
suspended: [:active, :frozen],
frozen: [:active, :archived],
archived: [:active, :deleted]
}
Six lines. That's the lifecycle product, in full. Drawn out:
stateDiagram-v2 [*] --> created : registered created --> active : docked active --> suspended : goes idle active --> archived : filed away suspended --> active : a request lands suspended --> frozen : stays idle frozen --> active : thaws — straight back frozen --> archived : filed away archived --> active : revived archived --> deleted : the only exit deleted --> [*]
Two edges in that map are the ones people get wrong, and both are
deliberate. An active sandbox cannot freeze directly — there is no
active → frozen hop. Freezing is earned by neglect, not
commanded: a sandbox idles into suspended first, and only a suspended
sandbox can go cold. And a frozen sandbox thaws straight to
active — frozen: [:active, :archived] — no re-warming
ladder, no passing back through suspended. Going cold is gradual; coming
back is one step. Even archived is revivable —
archived → active is in the map. The only state you can't
leave is deleted.
warm and COLD
Underneath the five names there is really only one distinction, and the engine's own vocabulary admits it. A session is warm when its Instance is live in the BEAM — checking costs one Registry lookup. It is cold when only its VFS file persists. Every state below active is just cold, filed in a different drawer:
| state | a live process? | where the file is | back to active by |
|---|---|---|---|
| created | no — a registry row only | wherever it was registered | first resume |
| active | yes — a supervised GenServer | open, in the live dir | already there |
| suspended | no | the live dir — the warm cache | one resume — the file is local |
| frozen | no | cold storage — cold/<id>.sqlite | copy back, clear tmp, start |
| archived | no | cold storage | revive — the map allows it |
Sessions.resume/3 reconciles the two readings in one
function: warm → {:warm, :already_active}, reuse the running
Instance and touch nothing. Cold → resolve the VFS, start the Instance
under the supervisor, flip the registry row to active, return
{:cold, vfs_path}. The VFS resolution is a three-rung
cascade: a local file is the warm cache and wins; a frozen
row with a cold dir restores from cold storage; otherwise the session
starts fresh.
Here's the whole machine in one demo, end to end — this is the real
shape of demo_resume in
runtime/host/demos/runtime.ex. Register a session for tenant
acme; resume it twice:
ControlPlane.register("sess-841", "acme", ":memory:")
→ row: (sess-841, acme, 'created', :memory:)
Sessions.resume("sess-841", bytes, policy: :minimal)
→ {:cold, ":memory:"} — VFS resolved · Instance started · state → "active"
Sessions.resume("sess-841", bytes, policy: :minimal)
→ {:warm, :already_active} — the live Instance reused, nothing touched
Sessions.warm?("sess-841") → true
ControlPlane.get("sess-841").state → "active"
One more verb worth knowing: prefetch/2. Prefetch-on-auth
pulls a session's VFS local — restoring from cold storage if needed —
without starting the Instance. The login warms the cache, so the
first real request lands on a local file instead of a cold fetch. The
policy: option, incidentally, is the
capability profile the Instance starts under —
that's its own lesson; here it's just a keyword.
stop the process, KEEP the file
Now the trick itself. Elsewhere, freezing a sandbox is a memory-snapshot
API call you pay for — guest RAM serialized, device state captured,
restore measured in seconds and cents. Here is the shipped implementation
of freeze/3, essentially verbatim:
def freeze(session_id, vfs_path, cold_dir) do
File.mkdir_p!(cold_dir)
frozen = Path.join(cold_dir, "#{session_id}.sqlite")
with :ok <- File.cp(vfs_path, frozen), do: {:ok, frozen}
end
Make a directory. Copy a file. The moduledoc states the philosophy
outright: the SQLite file IS the durable state — freeze is stop the
process, keep the file, no VM snapshot. Resume is the same trick
reversed, plus one act of hygiene — copy the file back to the live dir,
open it, clear the tmp volume, close it:
sequenceDiagram participant H as the host participant I as the Instance — a BEAM process participant L as live/sess-841.sqlite participant C as cold/sess-841.sqlite rect rgb(251,250,246) Note over H,C: freeze — stop the process, keep the file H->>I: terminate — the process is disposable H->>C: File.cp live → cold Note over C: this file is the entire snapshot end rect rgb(232,245,236) Note over H,C: resume — the same trick, reversed H->>L: File.cp cold → live H->>L: clear the tmp volume — scratch dies here H->>I: start_instance — registry row → active end
The obvious objection: what about everything in memory? The
answer is the contract that makes the whole page possible. Components
here are stateless between calls — re-instantiated at every
Instance start — so there is no precious linear memory to capture. A
component that wants durable state declares :persist, which
is a promise to checkpoint that state to the VFS, where the runtime
actually guarantees durability. The engine's
durable_components/1 walks the recursive plan — worlds and
nested sub-workflows alike — and returns exactly the
components that opted in. In the engine's own words: the VFS is the
orthogonal-persistence layer. A raw-memory snapshot isn't missing.
It doesn't fit a stateless component, and it isn't exposed — a non-goal,
not a gap.
the demotion LADDER
Who decides when a sandbox goes cold? Policy, written as two constants and one pure function. Fifteen minutes idle demotes active to suspended. Twenty-four hours suspended demotes to frozen. Frozen never auto-archives — archiving is a decision, not a decay:
flowchart LR a["active
someone — or some schedule — is here"] s["suspended
process gone · file still local"] f["frozen
file copied to cold storage"] ar["archived"] a -- "idle ≥ 15 min" --> s s -- "idle ≥ 24 h" --> f f -. "never automatic — a decision" .-> ar style a fill:#13d943,stroke:#121316,stroke-width:2.5px style s fill:#aee5c2,stroke:#121316 style f fill:#a8d4f0,stroke:#121316 style ar fill:#d9dbd3,stroke:#121316
The implementation is small enough to quote whole — and its demo output is the spec in miniature:
@idle_to_suspend 15 * 60
@idle_to_freeze 24 * 3600
def auto_next(:active, idle) when idle >= @idle_to_suspend, do: :suspended
def auto_next(:suspended, idle) when idle >= @idle_to_freeze, do: :frozen
def auto_next(_state, _idle), do: nil
demo_auto_transitions()
→ %{active_idle_16m: :suspended, active_fresh: nil,
suspended_idle_2d: :frozen, frozen_stays: nil}
What the ladder buys is the economics of the whole model: a thousand docked workbooks can exist while only a handful are warm, because an idle sandbox costs a registry row and a file, not a running VM. And note what counts as activity — a request landing, a schedule firing, an agent doing its rounds. Anything that touches the sandbox resets its idle clock.
One honest precision: auto_next/2 is a pure policy
function. The source says it plainly — the thresholds are the policy;
the driver that applies them on a tick is the session registry's job. The
constants are real, shipped, and demo-proven; the periodic ticker that
walks idle sessions through them is the registry's responsibility, not a
clock this page can promise is already running against your sessions. The
honest framing: demotion is what the system is built to do, at
exactly these thresholds.
one table runs the BUILDING
depth rung · skippable — the control plane, for the curious
Something has to remember ten thousand sandboxes' states while most of them are cold. The engine's answer is in the control plane's first line of documentation: SQLite is the control plane; the data plane is each Instance's VFS. Postgres is only needed if we go multi-machine. The schema, verbatim:
CREATE TABLE instances ( id TEXT PRIMARY KEY, -- the session tenant TEXT, -- whose it is state TEXT, -- created · active · suspended · frozen · archived vfs_path TEXT, -- where the disk lives updated INTEGER -- epoch seconds — the idle clock reads this ) one row: (sess-841, acme, 'active', '/live/sess-841.sqlite', 1765500000)
That row is the whole lifecycle bureaucracy for one sandbox.
register/3 inserts it at 'created';
set_state/2 flips the state and stamps updated;
get/1 and list/0 read it back — and
GET /instances on the control-plane web serves the list as
JSON. A second table, workbooks, stores each workbook's org
source — the deployable artifact itself. The registry is a GenServer in
the root supervision tree, next to the Instance registry and the Instance
supervisor.
Two honest notes. The registry's database path comes from the
WB_REGISTRY environment variable and defaults to
:memory: — out of the box the bookkeeping itself is
in-memory, and a durable registry is one env var away. And this is the
single-machine story by design: going multi-machine swaps the SQLite
registry for Postgres — the same flow, a different registry — which is an
engine-config concern, the CLI and deploy layer's
territory, not this page's.
what survives the WINTER
The file being frozen is the VFS — one SQLite store, three named volumes, each with its own survival contract. workspace is the working tree. memory is agent long-term memory. tmp is scratch — and resume clears it, by name, every time:
| volume | freeze → resume | share / egress | what it holds |
|---|---|---|---|
| workspace | survives | ships — the one public volume | the working tree — the work itself |
| memory | survives | stripped | what the agent learned |
| tmp | cleared on resume | never ships | scratch — disposable by definition |
The proof is a round trip you can run. demo_volumes writes
the same path into all three volumes, freezes, resumes, reads back:
put workspace /note "the workbook files"
put memory /note "what the agent learned"
put tmp /note "scratch"
freeze → resume
→ workspace: {:ok, "the workbook files"}
→ memory: {:ok, "what the agent learned"}
→ tmp_after_resume: :error
Tmp dying is not a bug — Lifecycle.resume literally calls
VFS.clear(conn, "tmp") on restore. Scratch that survived
hibernation wouldn't be scratch. And notice the second column of the
table: freeze keeps more than sharing does. Cold storage is
yours, so the agent's memory rides along; egress is for
others, so only the workspace ships —
public_volumes() is exactly ["workspace"].
Freezing and sharing both move the same file, with
different strip rules, because they answer to different audiences.
one template, many TENANTS
depth rung · skippable — the base-image model, for the curious
Once state-is-a-file has paid for freeze, it keeps paying. Multi-tenant
provisioning — elsewhere a golden-image pipeline — is the same file copy
pointed forwards. clone_for/3 copies a read-only base VFS, a
seeded Blueprint, into a fresh writable per-tenant file:
flowchart TD base["base.sqlite — the seeded template
read-only · never mutated"] a["tenant-acme-512.sqlite"] b["tenant-globex-513.sqlite"] base -- "File.cp" --> a base -- "File.cp" --> b style base fill:#f2ddb0,stroke:#121316,stroke-width:2.5px style a fill:#aee5c2,stroke:#121316 style b fill:#aee5c2,stroke:#121316
The demo seeds a base with /seed, clones it for tenants
acme and globex, and lets each write its own /own. The
checks come back exactly as isolation demands: both tenants inherit the
seed; neither sees the other's write; the base is untouched. Freeze is
cp pointed backwards — current state into cold storage.
Clone is cp pointed forwards — template into a tenant's
future. One primitive, both directions.
two clocks of DURABILITY
depth rung · skippable — backups vs filing, for the curious
Freeze is a filing decision, not a backup — it captures the file at a lifecycle moment, and between moments a crash would still cost you the gap. So durability here runs on two clocks at once:
file://… local · s3://bucket/path prod — the same command, only the URL and creds differMechanically: the database runs in WAL mode — required, one pragma —
and an external litestream binary holds a long-lived port
per replicated database, streaming write-ahead-log changes to the replica
out of band. The engine never blocks on it, and the replication doesn't
care what lifecycle state the sandbox is in — a suspended sandbox's last
writes are as replicated as an active one's. Restore is a single one-shot
command from replica to file. Two clocks, one file, and neither mechanism
needed to know about the other.
what this ISN'T
Honesty section. There is no linear-memory snapshot — and not as
a missing feature. A component that keeps state in WebAssembly memory and
never declares :persist loses that state at suspend. That's
the contract: durable state belongs in the VFS, and the system is honest
enough to make the alternative impossible rather than unreliable.
The idle tick is policy-complete, driver-pending. The thresholds
are shipped constants and auto_next/2 is demo-proven, but
applying them on a schedule is the session registry's job — don't read
this page as a promise that a ticker is demoting your sessions tonight.
The registry defaults to memory. WB_REGISTRY unset
means the control plane's own bookkeeping is in-memory — durable
registry is one env var, but it's your env var.
This is the single-machine flow. Multi-machine swaps the SQLite
registry for Postgres — same flow, different registry — and isn't covered
here. archived has a place in the map and the registry, and no
dedicated machinery beyond them — no archive/2 helper exists
yet. And none of this applies to an undocked
workbook file at all: a file sitting in your
repo has no lifecycle to manage, because there is no process to stop.
The state machine begins at docking.
questions people actually ASK
If I close my laptop, is my agent's work gone?
Your laptop was never the question — a docked sandbox lives on the engine. If it idles, the policy's worst case is demotion down the ladder: suspended, then frozen. Both are just the file in a different drawer; workspace and memory ride along intact, and resume brings it straight back to active.
What's actually different between suspended and frozen?
In both, the process is gone and only the file remains. Suspended keeps the file in the live directory — the warm cache, so resume is immediate. Frozen has copied it to cold storage, so resume is one copy back plus a tmp clear. That copy is the entire mechanical difference the code shows.
Does resume lose anything?
Exactly one thing, on purpose: the tmp volume, cleared by name on
every restore. Workspace and memory survive — the demo proves the round
trip. And anything a component kept in raw memory without declaring
:persist was never durable to begin with — that's the
contract, not a casualty.
Is freeze a backup?
No — freeze is a filing decision, a point-in-time copy made at a lifecycle moment. The backup is litestream: continuous WAL streaming to a replica, running independently of lifecycle state, restorable in one command. Two clocks, deliberately separate.
Can an archived sandbox come back?
Yes — archived → active is in the transition map.
Archive is a long-term drawer, not a grave. The only terminal state is
deleted, and you have to ask for it explicitly: nothing in the map or
the idle policy ever deletes on its own.
Why can't I freeze an active sandbox directly?
Because the map says no — active can go to suspended or
archived, never straight to frozen. Freezing is earned by neglect:
fifteen idle minutes to suspend, twenty-four suspended hours to freeze.
The thaw, though, is one step — frozen goes directly back to active.
keep GOING
This page is the Nexus's stateful promise, run through time. Its neighbors fill in the rest.