sandboxes — suspension is a file operation

pausing a computer is supposed to be HARD

Everyone selling agent infrastructure sells the same expensive miracle: we can freeze your microVM. Snapshot the guest memory, track the dirty pages, ship gigabytes to object storage, meter the hibernation, and charge for the thaw. The pitch works because the assumption underneath it feels like physics — if an agent's computer is a rented virtual machine, then pausing it really is a heavyweight systems operation, and resuming it really is slow, and somebody really does have to bill you for both.

You've probably internalized that assumption without noticing. Pause a running sandbox sounds like a hard problem the way isolate untrusted code sounds like a hard problem — and the Nexus lesson already dismantled the second one: isolation here is construction, not a rented perimeter. This page runs the identical move on state. If the thing you're pausing isn't a machine — if its entire durable state is one file and the process holding it is disposable by contract — then suspension stops being a systems feat and becomes what it always should have been. A filing decision.

the five STATES

sand·box /ˈsænd·bɑks/ noun

1. the running shape of a docked workbook — one WebAssembly instance under one supervised process, whose entire durable state is one SQLite file, and whose whole life is five explicit states: created → active → suspended → frozen → archived, with deleted as the only exit.

Two facts before the diagram. First, the state lives in a registry row, not in the process — a sandbox can be deep in cold storage and still be a first-class citizen of the system, because being a citizen costs one row. Second, transitions are explicit. The engine's transition/2 validates every hop against a map and answers {:ok, to} or {:error, :invalid} — there are no implicit shortcuts. Here is the entire state machine, as shipped, from runtime/host/lifecycle.ex:

@transitions %{
  created:   [:active],
  active:    [:suspended, :archived],
  suspended: [:active, :frozen],
  frozen:    [:active, :archived],
  archived:  [:active, :deleted]
}

Six lines. That's the lifecycle product, in full. Drawn out:

stateDiagram-v2
  [*] --> created : registered
  created --> active : docked
  active --> suspended : goes idle
  active --> archived : filed away
  suspended --> active : a request lands
  suspended --> frozen : stays idle
  frozen --> active : thaws — straight back
  frozen --> archived : filed away
  archived --> active : revived
  archived --> deleted : the only exit
  deleted --> [*]

Two edges in that map are the ones people get wrong, and both are deliberate. An active sandbox cannot freeze directly — there is no active → frozen hop. Freezing is earned by neglect, not commanded: a sandbox idles into suspended first, and only a suspended sandbox can go cold. And a frozen sandbox thaws straight to active — frozen: [:active, :archived] — no re-warming ladder, no passing back through suspended. Going cold is gradual; coming back is one step. Even archived is revivable — archived → active is in the map. The only state you can't leave is deleted.

warm and COLD

Underneath the five names there is really only one distinction, and the engine's own vocabulary admits it. A session is warm when its Instance is live in the BEAM — checking costs one Registry lookup. It is cold when only its VFS file persists. Every state below active is just cold, filed in a different drawer:

state	a live process?	where the file is	back to active by
created	no — a registry row only	wherever it was registered	first resume
active	yes — a supervised GenServer	open, in the live dir	already there
suspended	no	the live dir — the warm cache	one resume — the file is local
frozen	no	cold storage — `cold/<id>.sqlite`	copy back, clear tmp, start
archived	no	cold storage	revive — the map allows it

Sessions.resume/3 reconciles the two readings in one function: warm → {:warm, :already_active}, reuse the running Instance and touch nothing. Cold → resolve the VFS, start the Instance under the supervisor, flip the registry row to active, return {:cold, vfs_path}. The VFS resolution is a three-rung cascade: a local file is the warm cache and wins; a frozen row with a cold dir restores from cold storage; otherwise the session starts fresh.

Here's the whole machine in one demo, end to end — this is the real shape of demo_resume in runtime/host/demos/runtime.ex. Register a session for tenant acme; resume it twice:

ControlPlane.register("sess-841", "acme", ":memory:")
   → row: (sess-841, acme, 'created', :memory:)

Sessions.resume("sess-841", bytes, policy: :minimal)
   → {:cold, ":memory:"}     — VFS resolved · Instance started · state → "active"

Sessions.resume("sess-841", bytes, policy: :minimal)
   → {:warm, :already_active} — the live Instance reused, nothing touched

Sessions.warm?("sess-841")            → true
ControlPlane.get("sess-841").state    → "active"

One more verb worth knowing: prefetch/2. Prefetch-on-auth pulls a session's VFS local — restoring from cold storage if needed — without starting the Instance. The login warms the cache, so the first real request lands on a local file instead of a cold fetch. The policy: option, incidentally, is the capability profile the Instance starts under — that's its own lesson; here it's just a keyword.

stop the process, KEEP the file

Now the trick itself. Elsewhere, freezing a sandbox is a memory-snapshot API call you pay for — guest RAM serialized, device state captured, restore measured in seconds and cents. Here is the shipped implementation of freeze/3, essentially verbatim:

def freeze(session_id, vfs_path, cold_dir) do
  File.mkdir_p!(cold_dir)
  frozen = Path.join(cold_dir, "#{session_id}.sqlite")
  with :ok <- File.cp(vfs_path, frozen), do: {:ok, frozen}
end

Make a directory. Copy a file. The moduledoc states the philosophy outright: the SQLite file IS the durable state — freeze is stop the process, keep the file, no VM snapshot. Resume is the same trick reversed, plus one act of hygiene — copy the file back to the live dir, open it, clear the tmp volume, close it:

sequenceDiagram
  participant H as the host
  participant I as the Instance — a BEAM process
  participant L as live/sess-841.sqlite
  participant C as cold/sess-841.sqlite
  rect rgb(251,250,246)
  Note over H,C: freeze — stop the process, keep the file
  H->>I: terminate — the process is disposable
  H->>C: File.cp live → cold
  Note over C: this file is the entire snapshot
  end
  rect rgb(232,245,236)
  Note over H,C: resume — the same trick, reversed
  H->>L: File.cp cold → live
  H->>L: clear the tmp volume — scratch dies here
  H->>I: start_instance — registry row → active
  end

The obvious objection: what about everything in memory? The answer is the contract that makes the whole page possible. Components here are stateless between calls — re-instantiated at every Instance start — so there is no precious linear memory to capture. A component that wants durable state declares :persist, which is a promise to checkpoint that state to the VFS, where the runtime actually guarantees durability. The engine's durable_components/1 walks the recursive plan — worlds and nested sub-workflows alike — and returns exactly the components that opted in. In the engine's own words: the VFS is the orthogonal-persistence layer. A raw-memory snapshot isn't missing. It doesn't fit a stateless component, and it isn't exposed — a non-goal, not a gap.

the demotion LADDER

Who decides when a sandbox goes cold? Policy, written as two constants and one pure function. Fifteen minutes idle demotes active to suspended. Twenty-four hours suspended demotes to frozen. Frozen never auto-archives — archiving is a decision, not a decay:

flowchart LR
  a["active
someone — or some schedule — is here"]
  s["suspended
process gone · file still local"]
  f["frozen
file copied to cold storage"]
  ar["archived"]
  a -- "idle ≥ 15 min" --> s
  s -- "idle ≥ 24 h" --> f
  f -. "never automatic — a decision" .-> ar
  style a fill:#13d943,stroke:#121316,stroke-width:2.5px
  style s fill:#aee5c2,stroke:#121316
  style f fill:#a8d4f0,stroke:#121316
  style ar fill:#d9dbd3,stroke:#121316

The implementation is small enough to quote whole — and its demo output is the spec in miniature:

@idle_to_suspend 15 * 60
@idle_to_freeze  24 * 3600

def auto_next(:active, idle)    when idle >= @idle_to_suspend, do: :suspended
def auto_next(:suspended, idle) when idle >= @idle_to_freeze,  do: :frozen
def auto_next(_state, _idle), do: nil

demo_auto_transitions()
   → %{active_idle_16m: :suspended,  active_fresh: nil,
        suspended_idle_2d: :frozen,   frozen_stays: nil}

What the ladder buys is the economics of the whole model: a thousand docked workbooks can exist while only a handful are warm, because an idle sandbox costs a registry row and a file, not a running VM. And note what counts as activity — a request landing, a schedule firing, an agent doing its rounds. Anything that touches the sandbox resets its idle clock.

One honest precision: auto_next/2 is a pure policy function. The source says it plainly — the thresholds are the policy; the driver that applies them on a tick is the session registry's job. The constants are real, shipped, and demo-proven; the periodic ticker that walks idle sessions through them is the registry's responsibility, not a clock this page can promise is already running against your sessions. The honest framing: demotion is what the system is built to do, at exactly these thresholds.

one table runs the BUILDING

depth rung · skippable — the control plane, for the curious

Something has to remember ten thousand sandboxes' states while most of them are cold. The engine's answer is in the control plane's first line of documentation: SQLite is the control plane; the data plane is each Instance's VFS. Postgres is only needed if we go multi-machine. The schema, verbatim:

CREATE TABLE instances (
  id TEXT PRIMARY KEY,   -- the session
  tenant TEXT,           -- whose it is
  state TEXT,            -- created · active · suspended · frozen · archived
  vfs_path TEXT,         -- where the disk lives
  updated INTEGER        -- epoch seconds — the idle clock reads this
)
   one row: (sess-841, acme, 'active', '/live/sess-841.sqlite', 1765500000)

That row is the whole lifecycle bureaucracy for one sandbox. register/3 inserts it at 'created'; set_state/2 flips the state and stamps updated; get/1 and list/0 read it back — and GET /instances on the control-plane web serves the list as JSON. A second table, workbooks, stores each workbook's org source — the deployable artifact itself. The registry is a GenServer in the root supervision tree, next to the Instance registry and the Instance supervisor.

Two honest notes. The registry's database path comes from the WB_REGISTRY environment variable and defaults to :memory: — out of the box the bookkeeping itself is in-memory, and a durable registry is one env var away. And this is the single-machine story by design: going multi-machine swaps the SQLite registry for Postgres — the same flow, a different registry — which is an engine-config concern, the CLI and deploy layer's territory, not this page's.

what survives the WINTER

The file being frozen is the VFS — one SQLite store, three named volumes, each with its own survival contract. workspace is the working tree. memory is agent long-term memory. tmp is scratch — and resume clears it, by name, every time:

volume	freeze → resume	share / egress	what it holds
workspace	survives	ships — the one public volume	the working tree — the work itself
memory	survives	stripped	what the agent learned
tmp	cleared on resume	never ships	scratch — disposable by definition

The proof is a round trip you can run. demo_volumes writes the same path into all three volumes, freezes, resumes, reads back:

put workspace /note "the workbook files"
put memory    /note "what the agent learned"
put tmp       /note "scratch"
freeze → resume

   → workspace:        {:ok, "the workbook files"}
   → memory:           {:ok, "what the agent learned"}
   → tmp_after_resume: :error

Tmp dying is not a bug — Lifecycle.resume literally calls VFS.clear(conn, "tmp") on restore. Scratch that survived hibernation wouldn't be scratch. And notice the second column of the table: freeze keeps more than sharing does. Cold storage is yours, so the agent's memory rides along; egress is for others, so only the workspace ships — public_volumes() is exactly ["workspace"]. Freezing and sharing both move the same file, with different strip rules, because they answer to different audiences.

one template, many TENANTS

depth rung · skippable — the base-image model, for the curious

Once state-is-a-file has paid for freeze, it keeps paying. Multi-tenant provisioning — elsewhere a golden-image pipeline — is the same file copy pointed forwards. clone_for/3 copies a read-only base VFS, a seeded Blueprint, into a fresh writable per-tenant file:

flowchart TD
  base["base.sqlite — the seeded template
read-only · never mutated"]
  a["tenant-acme-512.sqlite"]
  b["tenant-globex-513.sqlite"]
  base -- "File.cp" --> a
  base -- "File.cp" --> b
  style base fill:#f2ddb0,stroke:#121316,stroke-width:2.5px
  style a fill:#aee5c2,stroke:#121316
  style b fill:#aee5c2,stroke:#121316

The demo seeds a base with /seed, clones it for tenants acme and globex, and lets each write its own /own. The checks come back exactly as isolation demands: both tenants inherit the seed; neither sees the other's write; the base is untouched. Freeze is cp pointed backwards — current state into cold storage. Clone is cp pointed forwards — template into a tenant's future. One primitive, both directions.

two clocks of DURABILITY

depth rung · skippable — backups vs filing, for the curious

Freeze is a filing decision, not a backup — it captures the file at a lifecycle moment, and between moments a crash would still cost you the gap. So durability here runs on two clocks at once:

one VFS file · two durability mechanisms

freezepoint-in-time — a lifecycle event copies the file to cold storage; lifecycle-driven

litestreamcontinuous — WAL changes stream to a replica as they happen; lifecycle-independent

the replica URLfile://… local · s3://bucket/path prod — the same command, only the URL and creds differ

restorea one-shot command back to a fresh file — the disaster path is one verb

freeze decides where the file is filed · litestream makes sure it can't be lost

Mechanically: the database runs in WAL mode — required, one pragma — and an external litestream binary holds a long-lived port per replicated database, streaming write-ahead-log changes to the replica out of band. The engine never blocks on it, and the replication doesn't care what lifecycle state the sandbox is in — a suspended sandbox's last writes are as replicated as an active one's. Restore is a single one-shot command from replica to file. Two clocks, one file, and neither mechanism needed to know about the other.

what this ISN'T

Honesty section. There is no linear-memory snapshot — and not as a missing feature. A component that keeps state in WebAssembly memory and never declares :persist loses that state at suspend. That's the contract: durable state belongs in the VFS, and the system is honest enough to make the alternative impossible rather than unreliable.

The idle tick is policy-complete, driver-pending. The thresholds are shipped constants and auto_next/2 is demo-proven, but applying them on a schedule is the session registry's job — don't read this page as a promise that a ticker is demoting your sessions tonight.

The registry defaults to memory. WB_REGISTRY unset means the control plane's own bookkeeping is in-memory — durable registry is one env var, but it's your env var.

This is the single-machine flow. Multi-machine swaps the SQLite registry for Postgres — same flow, different registry — and isn't covered here. archived has a place in the map and the registry, and no dedicated machinery beyond them — no archive/2 helper exists yet. And none of this applies to an undocked workbook file at all: a file sitting in your repo has no lifecycle to manage, because there is no process to stop. The state machine begins at docking.

questions people actually ASK

If I close my laptop, is my agent's work gone?

Your laptop was never the question — a docked sandbox lives on the engine. If it idles, the policy's worst case is demotion down the ladder: suspended, then frozen. Both are just the file in a different drawer; workspace and memory ride along intact, and resume brings it straight back to active.

What's actually different between suspended and frozen?

In both, the process is gone and only the file remains. Suspended keeps the file in the live directory — the warm cache, so resume is immediate. Frozen has copied it to cold storage, so resume is one copy back plus a tmp clear. That copy is the entire mechanical difference the code shows.

Does resume lose anything?

Exactly one thing, on purpose: the tmp volume, cleared by name on every restore. Workspace and memory survive — the demo proves the round trip. And anything a component kept in raw memory without declaring :persist was never durable to begin with — that's the contract, not a casualty.

Is freeze a backup?

No — freeze is a filing decision, a point-in-time copy made at a lifecycle moment. The backup is litestream: continuous WAL streaming to a replica, running independently of lifecycle state, restorable in one command. Two clocks, deliberately separate.

Can an archived sandbox come back?

Yes — archived → active is in the transition map. Archive is a long-term drawer, not a grave. The only terminal state is deleted, and you have to ask for it explicitly: nothing in the map or the idle policy ever deletes on its own.

Why can't I freeze an active sandbox directly?

Because the map says no — active can go to suspended or archived, never straight to frozen. Freezing is earned by neglect: fifteen idle minutes to suspend, twenty-four suspended hours to freeze. The thaw, though, is one step — frozen goes directly back to active.

keep GOING

This page is the Nexus's stateful promise, run through time. Its neighbors fill in the rest.

The Nexusthe parent — isolated, stateful, agentic

→ ⚿

Capabilitieswhat an active sandbox may reach

→

The VFSthe file this whole page copies

→ ⧉

Bundlesthe same file, leaving home

→