learn / 05·11 — under agents · fleets

one agent is a workerA TEAMis a protocol

A fleet isn't a new construct — it's the same keeper engine, instantiated many times from one org manifest. Each member is a full keeper with its own def, its own cadence, its own namespaced state. The runtime adds exactly two things on top — a stagger and a gate — and then deliberately does nothing else. Who works what is a protocol the agents follow, on a shared board, in git.

fleets11 min read
A small spacesuited figure standing on a gantry before a fleet of four identical monumental cargo ships, each tagged with a glowing name and queued at a single launch tower — bright blue-and-green 1970s sci-fi style, the figure dwarfed by the ordered formation

one worker, then a TEAM

This is a sub-lesson, and the placement is the first thing it teaches. You already have one standing agent working — the agents lesson built the keeper, and orchestration walked its tick. Now you want a team: a researcher feeding a writer feeding an editor, each on its own beat, handing work down a pipeline.

Every multi-agent framework you've seen answers "team" with a heavyweight orchestrator — a message bus, a role graph, a supervisor model deciding who speaks next. You suspect, correctly, that most of that is machinery for problems you don't have. What a team actually needs is smaller and stranger: N workers, a budget ceiling so they don't all hammer the model at once, and a way for them not to collide on the same task. That's the whole list. This lesson is those three things, and a careful account of the one thing the runtime refuses to do for you.

the DEFINITION

fleet /fliːt/ noun

1. a multi-agent keeper declared in one org manifest: one keeper worker per member, each with its own def, its own lifecycle, and its own namespaced state — to which the runtime adds a stagger and a gate, and deliberately nothing else.

Read the negative space. There's no orchestrator in that definition, no scheduler, no supervisor model. The fleet machinery is two source files — the crew supervisor and the gate — totalling around 270 lines. The intelligence isn't in the runtime. It's in the defs and on the board, which is exactly where the agent's non-deterministic reasoning already lives.

one file, one heading per AGENT

A fleet is turned on by pointing WB_CREW_DEF at one org file — the manifest. Each top-level heading is an agent's name; its :PROPERTIES: drawer is its config. Here is the real bit.ml newsroom manifest, whole — four agents, one file:

#+TITLE: crew — the bit.ml newsroom manifest

* desk
:PROPERTIES:
:DEF: /data/agents/desk.org
:INTERVAL: 45m
:END:
* moss
:PROPERTIES:
:DEF: /data/agents/researcher.org
:INTERVAL: 15m
:END:
* wren
:PROPERTIES:
:DEF: /data/agents/writer.org
:INTERVAL: 15m
:END:
* hale
:PROPERTIES:
:DEF: /data/agents/editor.org
:INTERVAL: 20m
:END:

Three properties do all the work, and only one is required:

propertyrequired?defaultwhat it means
:DEF:yesthe agent org def to run each tick. A member without a :DEF: is dropped and logged — it never becomes a worker.
:LIFECYCLE:noabsent → plain interval ticksan optional state-machine spec, for agents whose cadence has phases. Most members don't carry one.
:INTERVAL:no1hfallback cadence between ticks. Grammar is 10m / 2h / 90s / a bare millisecond count.

The manifest and the singleton are mutually exclusive: at boot, WB_CREW_DEF set starts the crew supervisor; otherwise WB_KEEPER_DEF set starts the lone keeper; otherwise neither runs. You never get both — a fleet replaces the singleton, it doesn't run beside it. The crew counts as active only when the manifest parses to at least one member that actually carries a :DEF:.

N copies of one ENGINE

Here's the move that keeps the whole thing small. A fleet member is not a new kind of agent — it's the singleton keeper's tick engine, lifted out verbatim and parameterized. One implementation; two callers. The singleton is one instance of it; each fleet member is another. Everything you learned about a keeper run — the exec shell, the workdir on the tenant git repo, the sixty-step ceiling, the plan mode — is identical here, because it is literally the same code.

What varies between instances is a small config struct: the member's name, its def path, its interval, a key suffix for its files, and — for fleet members only — an acquire and release pair that reaches the gate. The crew supervisor builds one worker per manifest heading and starts them under a one_for_one strategy, with the gate as its very first child so every worker can borrow a slot from the moment it boots:

flowchart TD
  sup["Crew Supervisor  (one_for_one)"]
  gate["Gate — first child
counting semaphore · max 2"] desk["worker: desk
def desk.org · 45m
keeper-last-run-desk"] moss["worker: moss
def researcher.org · 15m
keeper-last-run-moss"] wren["worker: wren
def writer.org · 15m
keeper-last-run-wren"] hale["worker: hale
def editor.org · 20m
keeper-last-run-hale"] sup --> gate sup --> desk sup --> moss sup --> wren sup --> hale desk -. "acquire / release" .-> gate moss -. "acquire / release" .-> gate wren -. "acquire / release" .-> gate hale -. "acquire / release" .-> gate style sup fill:#9fc4e8,stroke:#121316,stroke-width:2.5px style gate fill:#13d943,stroke:#121316,stroke-width:2.5px style desk fill:#ffffff,stroke:#121316 style moss fill:#ffffff,stroke:#121316 style wren fill:#ffffff,stroke:#121316 style hale fill:#ffffff,stroke:#121316

Four independent keepers, each pointed at its own def, each on its own clock, each writing its own state file — and all four reaching back to one shared gate. The supervisor doesn't sequence them. It starts them and restarts any that crash. The only thing in the picture that makes the four aware of each other at all is that single green box.

booting without the HERD

Depth rung. If four workers all boot and tick at once, they fire four model calls in the same instant — a thundering herd at the worst possible moment, startup. So the supervisor staggers them: the i-th worker delays its first tick by i × WB_CREW_STAGGER_MS, default thirty seconds, on top of a sixty-second boot grace floor. The agents visibly overlap on the wire later, but they don't pile onto the LLM at boot.

For the four-member newsroom, with defaults, the first ticks land like this:

memberindexfirst tick at
desk0t = 60s (grace + 0×30)
moss1t = 90s (grace + 1×30)
wren2t = 120s (grace + 2×30)
hale3t = 150s (grace + 3×30)

After the first tick, each worker runs on its own interval. And the cadence clock is persisted, not reset on restart — each worker writes its last-run timestamp to a file and schedules catch-up from there. Restart the engine and a worker that was due doesn't wait a fresh full interval; it picks up where its clock left off. Stagger smooths the boot; persistence keeps the beat honest across restarts.

the gate: a 66-line SEMAPHORE

The gate is the entire concurrency story, and it is a tiny GenServer — a counting semaphore, sixty-six lines, default maximum of two (WB_CREW_MAX_CONCURRENT). Every fleet worker wraps each run in acquire → try → after release. That's the whole contract, and three details make it correct:

  • Acquire blocks. It's a GenServer.call with an :infinity timeout. A free slot returns :ok immediately; otherwise the caller is parked in a FIFO queue — held, no reply yet, until a slot frees. FIFO means no worker starves: the one that waited longest goes next.
  • Release hands off directly. When a run finishes, release checks the queue. If someone's waiting, the slot is transferred straight to the queue head — the freed count never even touches the pool, so the slot can't be stolen by a newcomer. Only if nobody's queued does the slot return to the pool.
  • Release runs in an after. A crashed run, or one killed at the fifteen-minute wall-clock ceiling, still hits its after and returns its slot. A wedged worker can never deadlock the fleet.

Watch three workers contend for two slots:

sequenceDiagram
  participant W as wren
  participant M as moss
  participant H as hale
  participant G as Gate (max 2)
  W->>G: acquire
  G-->>W: :ok  (slot 1 of 2)
  M->>G: acquire
  G-->>M: :ok  (slot 2 of 2)
  H->>G: acquire
  Note over H,G: no slot — hale PARKED in FIFO queue
(blocked call, :infinity timeout) Note over W: wren's run ends — release (in after) W->>G: release G-->>H: :ok (slot handed straight to hale) Note over G: free count never touched the pool
FIFO order preserved

The singleton keeper, for contrast, does not use the gate at all — its acquire and release are no-ops. A lone worker has no one to contend with, so there's nothing to count. The gate exists precisely because a fleet does.

why two agents never share a CLOCK

Depth rung. Two keepers running off the same engine could clobber each other's state files if nothing kept them apart. The thing that keeps them apart is a key_suffix"-<name>" — threaded through every piece of persistence. The singleton's files are unsuffixed; each fleet worker's are tagged with its name:

whatsingleton keeperfleet worker (e.g. wren)
last-run timestampkeeper-last-runkeeper-last-run-wren
lifecycle positionlifecycle-poslifecycle-pos-wren
status (persistent_term){Keeper, :status}{Keeper, :status, "-wren"}
registered nameKeeper.Crew.wren

Because each worker owns its own last-run file and its own lifecycle position, two agents never share a cadence clock — wren advancing through its states can't nudge moss's. The tested behaviour is exactly this: two workers write their own last-run files and their own status keys; two agents advance their lifecycle positions without interfering; and the singleton's namespace stays unsuffixed throughout.

One more deliberate choice: status is read from :persistent_term, never through a GenServer.call. Ticks run synchronously, so a status call would block for the entire length of a run — up to fifteen minutes. Reading a term is instant, which is what lets the public feed enumerate a busy fleet without ever waiting on a worker.

claims are a protocol, not a LOCK

This is the core of the page. The runtime gives you isolation (per-worker workdir and state) and throttling (the gate). It does not coordinate the agents. There is no lock on the board, no central allocator handing tasks to workers. Claiming is the agents' protocol — a board state change plus an :AGENT: property that the claiming agent commits to git before doing the work, visible to its peers. Coordination correctness lives in the def and on the board, the same place the non-deterministic interior already does.

Concretely, here is the protocol from bit.ml's shared laws file, made of git commits. A task sits on the board, unclaimed:

#+TODO: ASSIGNED RESEARCH WRITING EDIT | PUBLISHED KILLED
*** ASSIGNED hello world — one story through the whole pipeline

moss ticks. It finds the first task in its state — ASSIGNED — with no :AGENT: property, and claims it before touching the work: it sets the property and commits research: claim hello-world.

*** RESEARCH hello world — one story through the whole pipeline
:PROPERTIES:
:AGENT: moss
:END:

Then it works — writes the research skeleton — advances the task to WRITING, clears its :AGENT:, appends a log line, and commits research: hello-world — 9 facts, 4 sources, gaps: …. Here's the sequence, and the crucial part is the last actor:

sequenceDiagram
  participant B as the board (git)
  participant M as moss (researcher)
  participant W as wren (writer)
  M->>B: read — first unclaimed ASSIGNED task
  M->>B: set :AGENT: moss · commit "research: claim hello-world"
  Note over M: work — write content/research/hello-world.org
  M->>B: advance → WRITING · clear :AGENT: · commit "research: …"
  Note over W: wren's next tick
  W->>B: read — task is WRITING, unclaimed → wren claims it
  Note over B,W: a task carrying ANOTHER agent's :AGENT:
is invisible to you

The runtime never read that board. The agents did. A task claimed by another agent is simply invisible to you — that's a rule in the def, enforced by every agent following it, recorded in git for anyone to audit. Why put correctness here instead of in a lock? Because this is where the agent's judgment already is. The same file that holds an agent's reasoning holds its claiming discipline, so you can read it, edit it, and watch it fail — in plain text, in the git log — rather than debugging a coordinator you can't see.

a real fleet: the NEWSROOM

The bit.ml newsroom is the worked example. Four agents, none carrying a lifecycle — all plain interval workers: desk (assignment only, never writes, 45m), moss (researcher, 15m), wren (writer, 15m), hale (editor, 20m). Every member reads one shared laws file first via #+SHARED:; each role def adds only its own territory and its hand-offs. The pipeline states double as the workflow — a story is a task, and agents claim by state:

flowchart LR
  A["ASSIGNED"] -->|moss claims| R["RESEARCH"]
  R -->|moss advances| Wr["WRITING"]
  Wr -->|wren writes| E["EDIT"]
  E -->|hale approves| P["PUBLISHED"]
  E -. "hale bounces back" .-> Wr
  desk["desk — opens 2–6 assignments
leads, not facts"] -.-> A style A fill:#f2ddb0,stroke:#121316 style R fill:#a8d4f0,stroke:#121316 style Wr fill:#9fc4e8,stroke:#121316 style E fill:#f3c5a3,stroke:#121316 style P fill:#13d943,stroke:#121316,stroke-width:2.5px style desk fill:#fbfaf6,stroke:#121316

The hand-off chain is the pipeline. desk opens assignments but never writes facts. moss claims an ASSIGNED task, researches it into a fact skeleton with sources and a * gaps section, advances it to WRITING. wren picks up WRITING tasks under one law — the skeleton is your only universe of facts — and never invents beyond it. hale alone publishes: it gates on sources, and can bounce a story back to WRITING. No agent reaches into another's state; each advances and clears, and the next role finds the task waiting.

The public record is the git log itself, with typed commit prefixes — desk: / research: / write: / edit: / publish: — so the whole newsroom's activity reads as a changelog. And when an agent finds nothing in its state, it ends its run with done text beginning NO-WORK; the worker reads that signal and stretches its next tick, idle-backing-off from minutes toward a thirty-minute cap until there's work again. (Honest caveat: the bit.ml repo is a scaffold; the fleet machinery and its tests are real and shipped, but there's no live URL to point at yet.)

watching a FLEET

Depth rung. A fleet's activity is public and read-only at GET /_activity. For a fleet it returns a crew shape with three keys — per-agent entries, a merged wire, and one legacy block for crew-unaware frontends:

{
  "agents": [
    {"name": "moss", "running": true, "lifecycle": null,
     "steps": [{"tool": "fetch", "target": "https://arxiv.org/abs/…",
                "ts": 1760000000, "agent": "moss"}],
     "thought": "reading the primary source"},
    {"name": "wren", "running": false, "lifecycle": null,
     "steps": [], "thought": null}
  ],
  "wire":  [ /* last 10 steps across ALL agents, each tagged "agent": … */ ],
  "agent": { /* legacy: the busiest agent, for crew-unaware frontends */ }
}

Each agents entry carries that member's running flag, lifecycle, last few steps, and current thought. The wire is the last ten steps across everyone, merged and tagged. The agent block is a courtesy for older frontends — the busiest running agent, or the most recent if all are idle. And it all comes from one shared _steps.jsonl in the tenant repo: every event carries an agent tag, and the feed simply filters by it. The fleet doesn't keep four logs — it keeps one, agent-stamped, which is the same ledger the the-ledger lesson seals. The crew exposes member names and per-agent contexts precisely so the public plane can enumerate and read worker status without a single GenServer call.

what the runtime won't DO for you

Honesty section. The fleet is deliberately thin, and the thinness has edges you should know before you lean on it.

  • The gate throttles; it doesn't schedule. There are no priorities and no fairness beyond FIFO. If you need agent X to always go before agent Y, the gate won't give you that — it only counts.
  • Claims are advisory. A misbehaving def can grab a task another agent claimed. The defence isn't a lock — it's the def and the git record. You catch a bad claim by reading history, not by trusting the runtime to forbid it.
  • There's a race window. Within a single staggered tick, two agents could read the board and both try to claim the same task; commit order decides the winner. Claiming first shrinks the window — it doesn't eliminate it.
  • Adding a member means a restart. The supervisor's child list is fixed when it starts. The manifest is re-read from disk on each access, but editing it to add or remove a live worker most likely needs the crew supervisor to restart — don't count on hot hot-add.
  • NO-WORK honesty is def-level. Idle backoff only quiets an agent that admits it's idle. An agent whose def invents busywork will burn budget happily — the gate caps concurrency, not imagination.

And the values line, because it's the whole posture: this is never a newsroom that runs itself. Humans set the board's objectives — the leads, the assignments, the direction. The agents work the pipeline. A fleet makes the working cheap and legible; it doesn't make the judgment go away.

questions people actually ASK

Can two agents grab the same task?

In the common case, no — claiming-before-working plus FIFO state discipline makes a claimed task invisible to peers. But it's a protocol, not a lock: inside one staggered tick there's a narrow race where two agents read the board before either commits, and commit order decides it. The protocol shrinks that window; it doesn't close it. You audit it in git.

Why doesn't the runtime just lock the board?

Because coordination correctness belongs in the same layer as the agent's reasoning — the def — where you can read, edit, and review it. A runtime lock would move that logic somewhere you can't see or diff. The runtime isolates runs and throttles concurrency; it deliberately leaves coordination to the board, in plain text, in git.

How do I add a fifth agent?

Add a heading with a :DEF: to the manifest and restart the crew. The supervisor builds its worker list at startup, so a new member becomes a worker on the next boot — not live, mid-run. The manifest itself is cheap to re-read, but the worker set is fixed when the supervisor starts.

Does the singleton keeper change when I use a fleet?

You don't run both — they're mutually exclusive at boot. The fleet worker is the singleton's engine, lifted out and parameterized, so the run mechanics are identical. The only differences are namespaced state files and a real gate (the singleton's acquire and release are no-ops).

What does a fleet cost to run?

The gate is your budget ceiling — at most WB_CREW_MAX_CONCURRENT runs happen at once, default two, no matter how many members you have. And NO-WORK backoff quiets idle agents toward a thirty-minute tick, so a fleet with little to do stops paying for itself until work arrives. You scale members for coverage and cap spend with one number.

Is this like the Autopoet?

Opposite shape. The Autopoet is one standing agent tending the system. A fleet is many standing agents working a pipeline. Same engine underneath; different count, different subject.

keep GOING

A fleet is the agents lesson, multiplied — and it leans on three more.