orchestration — cadence for agents

the machinery of the SCHEDULE

This is a deep dive under Agents, and it picks up exactly one thread the parent left dangling. The agents lesson defined a standing agent as a worker that needs four things — files, tools, memory, and time. It spent its pages on the first three and handed you the fourth as a single word: a model on a schedule.

That word hides an engine. A schedule that's worth trusting has to survive crashes, redeploys, idle stretches, and the moment you run a second agent beside the first. This page is that engine — about four hundred lines of tick loop plus a state machine you declare in the grammar. If the parent lesson is new to you, read it first; everything here is the when around the what it already taught.

cron is not a COLLEAGUE

The naive version is one line: a cron job that calls a model. It works in the demo and fails in production, in five specific ways you've either hit or will:

The schedule dies with the run. The model call throws, the process that scheduled it goes down, and your standing agent is now a stopped agent — discovered hours later.
A redeploy resets the clock. You push a fix at 2pm; the hourly agent that was due at 2:05 now fires a fresh hour later. Every deploy silently skips a beat.
You pay for idle polls. Nothing to do at this tick, but the model still gets called to discover that — a full LLM bill every poll, forever, to be told nothing.
The days have no shape. The agent does "stuff" every tick with no structure — no sense that mornings add work, afternoons audit it, and there should be a quiet stretch in between.
Two of them grab the same task. The instant you run a second agent on the same board, both reach for the top ticket and you've paid twice for one result.

Orchestration is the answer to all five. And the thing worth sitting with: almost none of the answer is intelligence. It's a careful loop, a few files on disk, and a state machine in plain text.

the DEFINITION

or·ches·tra·tion /ˌɔr·kə·ˈstreɪ·ʃən/ noun

1. the deterministic skeleton around non-deterministic work — a tick engine that owns when an agent runs, a lifecycle declared in org that owns what kind of work each tick does, and a claims protocol that says who took which task. One worker is one agent is one cadence.

The whole design is a split, and it's written into the code as a comment: org owns the spec; this module just interprets and steps it. The runtime is rigid on purpose — it decides timing, position, gates, and retries, and it never improvises. The intelligence lives inside a state, where the agent decides what the work actually is. Determinism on the outside, judgment on the inside. Everything below is consequences of that one line.

anatomy of a TICK

Start with one heartbeat. A worker — one per agent — gets a :tick message and runs a fixed sequence. It records the time so a restart can do the arithmetic later; optionally pulls the tenant's git origin so a push from anywhere becomes live within one tick; consults the lifecycle for which state it's in; then runs the agent definition inside a killable task under a fifteen-minute wall clock. Whatever comes back maps to one of four outcomes, and that outcome decides the next delay.

flowchart TD
  t([":tick"]) --> rec["record last-run time"]
  rec --> git{"WB_GITOPS?"}
  git -- yes --> pull["pull tenant origin · best-effort"]
  git -- no --> lc
  pull --> lc["consult lifecycle — which state?"]
  lc --> gate{"state gated?
(MIN-INTERVAL)"}
  gate -- "not elapsed" --> hold["no-op · hold position"]
  gate -- "open" --> run["run def in Task.async
wall clock 15 min"]
  run --> out{"outcome"}
  out -- ":done" --> adv["advance position"]
  out -- ":no_work" --> ff["fast-forward repeats"]
  out -- ":failed / :killed" --> keep["hold same state · retry"]
  out -- ":no_work streak" --> back["exponential backoff"]
  adv --> sched["Process.send_after(next)"]
  ff --> sched
  keep --> sched
  hold --> sched
  back --> sched
  style t fill:#9fc4e8,stroke:#121316,stroke-width:2.5px
  style run fill:#13d943,stroke:#121316,stroke-width:2.5px
  style sched fill:#f2ddb0,stroke:#121316

Two safety properties are doing quiet work in that picture. The run executes in a linked task the worker can yield on or shutdown with a brutal kill — so a run that hangs is bounded at fifteen minutes, and a run that crashes never takes the worker down with it. The schedule outlives the run, which is failure mode number one, solved. And the run is the same call path as the public /api/run endpoint — the standing agent and a one-off request execute identically, which is why a once-correct run is a forever-correct run.

a day declared in ORG

A schedule that just fires every hour has no shape. The lifecycle gives the day a shape — and it's a deterministic state machine declared in org, executed one transition per tick. The headings are states; the property drawers — the same drawer convention the workflow layer uses — carry the edges and gates:

#+START: wake_add
* wake_add
:PROPERTIES:
:KIND: wake
:REPEAT: 3
:NEXT: wake_audit
:END:
* wake_audit
:PROPERTIES:
:KIND: wake
:NEXT: rem
:END:
* rem
:PROPERTIES:
:KIND: rem
:NEXT: wake_plan
:MIN-INTERVAL: 10m
:END:
* wake_plan
:PROPERTIES:
:KIND: wake
:NEXT: wake_add
:END:

Read it as a loop. #+START: names the entry state. :KIND: wake runs the agent definition; :KIND: rem is a quiet beat — it hands off to the dream phase and never calls the model. :REPEAT: 3 means do wake_add three successful ticks before taking its :NEXT: edge. :MIN-INTERVAL: 10m on rem is a time gate — that state refuses to run until ten minutes have passed since it last did. The canonical loop is: add three times, audit once, rest, plan, and back to adding.

stateDiagram-v2
  [*] --> wake_add
  wake_add --> wake_add: done (hits < 3)
  wake_add --> wake_audit: 3rd success
  wake_audit --> rem: done
  rem --> wake_plan: done · gated 10m
  wake_plan --> wake_add: done
  note right of rem
    KIND rem = dream, no model
    MIN-INTERVAL 10m time gate
  end note

The state machine is the deterministic skeleton; what the agent does inside wake_add stays non-deterministic — that's the definition's job, not the engine's. And the spec file is re-read every tick, kept deliberately dumb, so a spec you hot-edit is picked up next beat. You can change the shape of a live agent's day without restarting it. This is distinct from a workflow's plan.org — that's the ongoing task DAG that never "completes"; the lifecycle is the recurring skeleton the agent walks while it works that plan.

four outcomes, one rule EACH

Every tick ends in exactly one of four outcomes, and each one does a single, predictable thing to the agent's position — the pair {state, hits}. Learn these four rules and you can predict the next six ticks of any lifecycle.

outcome	what it means	effect on position	next delay
:done	real work happened	hits + 1; at `:REPEAT:` take `:NEXT:`, reset hits	hot cadence — base interval
:no_work	result began with `NO-WORK`	collapse remaining repeats, take `:NEXT:` now	backoff (see below)
:failed	the run errored	hold — retry the same state next tick	base interval
:killed	blew the 15-min wall clock	hold — retry the same state next tick	base interval

The headline is the bottom two rows. A crash or a timeout holds position — the cadence clock is never lost, the agent retries the same state on its next beat. Your place in the day is durable against the worst thing a run can do.

The :no_work rule has a subtlety worth stating precisely: it's a fast-forward of repeats, never a skip of states. If there's genuinely nothing to add, wake_add collapses its remaining repeats and moves on immediately — but every state in the declared order still runs. The audit still happens, the rest still happens. The engine speeds through empty repeats without ever letting a state get skipped, so cadenced work like audits and dreams keeps its rhythm.

cadence that survives the MACHINE

Depth rung — skippable, but it's where "a model on a schedule" becomes trustworthy. The agent's whole sense of time lives in three tiny files on disk, which is what makes a redeploy resume mid-cadence instead of starting over.

file	contents	what it preserves
`keeper-last-run`	unix time of the last tick	catch-up math across a restart
`lifecycle-pos`	literally `wake_add 2`	your place — the 2nd of 3 adds
`lifecycle-ran-<state>`	last time a gated state ran	the `MIN-INTERVAL` clock

The catch-up arithmetic is one expression: on restart the next delay is max(60s, interval − elapsed). If an hourly agent was forty minutes into its hour when you redeployed, it waits the remaining twenty — not a fresh hour. The sixty-second floor is a boot grace so a restart loop can't hammer the model. Failure mode number two — the reset clock — solved by reading a file.

Two more details earn their keep. The current state is mirrored into process-memory so the public plane can read "where is this agent in its day" without a blocking call — and that matters because a tick runs synchronously, so asking a busy worker directly would make you wait the whole run. And a never-run gated state reads as infinitely elapsed, so its gate starts open. The position is a fact on disk, not a fact in a fragile process.

the price of IDLENESS

Depth rung. Failure mode number three — paying for idle polls — gets its own mechanism, because it's the one that costs real money. The convention is blunt: an agent with nothing to do ends its run with output that begins with the literal word NO-WORK. The engine reads that and, instead of running the next tick at full cadence, backs off exponentially.

The next delay on a NO-WORK streak is max(base, 1 min · 2^(streak−1)), capped at thirty minutes. For a continuous worker on the default 45-second breather, the schedule reads:

NO-WORK streak	0	1	2	3	4	5	6+
next delay	45s	1m	2m	4m	8m	16m	30m (cap)

The verdict of that row: an idle agent quiets itself from roughly eighty model calls an hour down to two. And the recovery is instant — a single :done resets the streak and snaps the agent straight back to hot cadence. It sleeps deeper the longer it's idle and wakes fully the moment there's work. This is the engine's answer to the fair question doesn't a standing agent cost a fortune doing nothing — no, because doing nothing is the one thing it gets cheap at.

many workers, one BOX

Failure mode number five appears the instant you run two agents. The engine runs a set of them from one manifest — headings are agent names, each with a :DEF: (its definition), an optional :LIFECYCLE:, and an :INTERVAL: fallback cadence. The full manifest grammar belongs to the fleets deep dive; here's the shape:

* wren
:PROPERTIES:
:DEF: defs/writer.org
:LIFECYCLE: lifecycle.org
:INTERVAL: 10m
:END:
* moss
:PROPERTIES:
:DEF: defs/editor.org
:INTERVAL: 20m
:END:

Each member gets a full, independent worker — its own definition, its own lifecycle, its persistence namespaced by name. N agents tick in parallel with no shared cadence state. Two things keep that from becoming chaos. First, a stagger: the i-th worker's first tick is delayed by thirty seconds times its index, so they don't all hit the model at boot. Second, a concurrency gate — a counting semaphore, default max two — that every run must pass through.

sequenceDiagram
  participant A as worker wren
  participant G as the Gate (max 2)
  participant B as worker moss
  A->>G: acquire — slot free, granted
  Note over A: run (holds a slot)
  B->>G: acquire — slots full, parks FIFO
  Note over A: run blows the 15-min wall clock
  A-->>G: release (after-clause, even on timeout)
  G-->>B: slot handed to queue head
  Note over B: run begins

The gate's important property is in that last exchange. The worker wraps its run so the slot is released in an after clause — which means a run that crashes or blows its wall clock still returns its slot. A wedged worker can never deadlock the others. The cap protects the box from a thundering herd of model calls; it is not a fairness algorithm beyond first-in-first-out.

claims, not LOCKS

Now the subtle one. Two agents share a board — what stops them grabbing the same task? The deliberate answer: the runtime locks nothing. Claiming is a definition-level protocol, not a runtime-enforced mutex. An agent claims a task by changing its board state and writing an :AGENT: property — then committing that to git before it does the work, where every peer can see it.

sequenceDiagram
  participant W as wren
  participant Git as the board (git)
  participant M as moss
  W->>Git: read board — task-7 is NEXT
  W->>Git: commit: task-7 → DOING · :AGENT: wren
  Note over W: now does the work
  M->>Git: tick — pull first
  Git-->>M: task-7 is DOING, claimed by wren
  Note over M: takes task-8 instead

The runtime's job is narrower and stronger than locking: it isolates runs — each worker has its own workdir and state — and it throttles them through the gate. Coordination correctness lives in the definition and the board, which is exactly where the non-deterministic interior of a state already lives. The git commit makes the claim visible, auditable, and revertible — a social protocol, the way a person grabs a ticket off the wall before starting, not a database row-lock. It narrows the race; it does not, by itself, eliminate it (more on that next).

where the EDGES are

Honesty section. This engine is small and rigid, and its limits come from exactly that.

Claims are unenforced. A badly written definition that works before it pulls, or skips the commit-first discipline, can double-grab a task. The protocol makes a collision visible and auditable — it doesn't make it impossible. Correctness is culture here, carried by the definition.
Failed states retry forever. A state that keeps erroring holds position and retries on every beat. There's no poison-state escape hatch beyond the backoff slowing it down — a permanently broken state is a permanently stuck cadence until a human looks.
The gate caps the box, not fairness. It bounds how many runs hit the model at once and serves its queue first-in-first-out. It makes no per-agent fairness guarantee beyond that order.
Min-interval is elapsed time, not a calendar. A gated state fires when enough wall-clock has passed since it last ran — it is not "every weekday at nine." For calendar scheduling you want the workflow layer's timestamps, not a lifecycle gate.
One box, no distributed keeper. The workers run on a single engine. There is no cross-machine cadence coordination — scaling past one box is not this design.

questions people actually ASK

Is this just cron?

No — three differences that matter. Cron fires on the clock and forgets everything else. This engine is outcome-aware (a crash holds your place, real work advances it), position-preserving across restarts (catch-up math from a file on disk), and cost-backed-off (an idle agent quiets itself instead of paying for every poll). Cron can't lose its place because it never had one.

Can I change a live agent's lifecycle?

Yes. The spec file is re-read every single tick, deliberately. Hot-edit the state machine and the next beat picks it up — no restart, no redeploy. The agent keeps its current position unless you removed the state it was standing in, in which case it resets to the spec's start.

What if two agents claim the same task?

The git commit makes the race visible and auditable rather than silent — you can see both claims in history. The definition's discipline of pulling before it claims narrows the window hard. The runtime itself does not lock, by design; coordination correctness lives in the definition layer, beside the rest of the agent's judgment.

How do I see where an agent is in its day?

The public plane exposes per-agent status — whether it's running, when it last ran, when it runs next, and its current lifecycle position. It reads from process memory, so it answers instantly even while a tick is mid-run. When an agent "isn't running," that status plus the on-disk position files are your two debugging tools.

What triggers a run right now?

A single watched manual tick — it sends the worker the same :tick message the timer would, immediately, for validation. It's the one-off you reach for when you want to see a run with your own eyes rather than wait for the cadence.

Where does a rest beat actually go?

A rem state hands off to the dream phase — a consolidation step that runs without the model. The MIN-INTERVAL on the state is what keeps that phase on a humane rhythm rather than every beat.

keep GOING

This deep dive is the when around the agent's what — the parent and its neighbors fill in the rest.

Agentsthe parent — the model this schedules

→ ⛬

Fleetsthe agent manifest in full

→

Workflowsthe plan a cadence walks

→

The Autopoeta standing agent on this exact machinery

→