the machinery of the SCHEDULE
This is a deep dive under Agents, and it picks up exactly one thread the parent left dangling. The agents lesson defined a standing agent as a worker that needs four things — files, tools, memory, and time. It spent its pages on the first three and handed you the fourth as a single word: a model on a schedule.
That word hides an engine. A schedule that's worth trusting has to survive crashes, redeploys, idle stretches, and the moment you run a second agent beside the first. This page is that engine — about four hundred lines of tick loop plus a state machine you declare in the grammar. If the parent lesson is new to you, read it first; everything here is the when around the what it already taught.
cron is not a COLLEAGUE
The naive version is one line: a cron job that calls a model. It works in the demo and fails in production, in five specific ways you've either hit or will:
- The schedule dies with the run. The model call throws, the process that scheduled it goes down, and your standing agent is now a stopped agent — discovered hours later.
- A redeploy resets the clock. You push a fix at 2pm; the hourly agent that was due at 2:05 now fires a fresh hour later. Every deploy silently skips a beat.
- You pay for idle polls. Nothing to do at this tick, but the model still gets called to discover that — a full LLM bill every poll, forever, to be told nothing.
- The days have no shape. The agent does "stuff" every tick with no structure — no sense that mornings add work, afternoons audit it, and there should be a quiet stretch in between.
- Two of them grab the same task. The instant you run a second agent on the same board, both reach for the top ticket and you've paid twice for one result.
Orchestration is the answer to all five. And the thing worth sitting with: almost none of the answer is intelligence. It's a careful loop, a few files on disk, and a state machine in plain text.
the DEFINITION
1. the deterministic skeleton around non-deterministic work — a tick engine that owns when an agent runs, a lifecycle declared in org that owns what kind of work each tick does, and a claims protocol that says who took which task. One worker is one agent is one cadence.
The whole design is a split, and it's written into the code as a comment: org owns the spec; this module just interprets and steps it. The runtime is rigid on purpose — it decides timing, position, gates, and retries, and it never improvises. The intelligence lives inside a state, where the agent decides what the work actually is. Determinism on the outside, judgment on the inside. Everything below is consequences of that one line.
anatomy of a TICK
Start with one heartbeat. A worker — one per agent — gets a
:tick message and runs a fixed sequence. It records the time so a
restart can do the arithmetic later; optionally pulls the tenant's git origin
so a push from anywhere becomes live within one tick; consults the lifecycle
for which state it's in; then runs the agent definition inside a killable
task under a fifteen-minute wall clock. Whatever comes back maps to one of
four outcomes, and that outcome decides the next delay.
flowchart TD
t([":tick"]) --> rec["record last-run time"]
rec --> git{"WB_GITOPS?"}
git -- yes --> pull["pull tenant origin · best-effort"]
git -- no --> lc
pull --> lc["consult lifecycle — which state?"]
lc --> gate{"state gated?
(MIN-INTERVAL)"}
gate -- "not elapsed" --> hold["no-op · hold position"]
gate -- "open" --> run["run def in Task.async
wall clock 15 min"]
run --> out{"outcome"}
out -- ":done" --> adv["advance position"]
out -- ":no_work" --> ff["fast-forward repeats"]
out -- ":failed / :killed" --> keep["hold same state · retry"]
out -- ":no_work streak" --> back["exponential backoff"]
adv --> sched["Process.send_after(next)"]
ff --> sched
keep --> sched
hold --> sched
back --> sched
style t fill:#9fc4e8,stroke:#121316,stroke-width:2.5px
style run fill:#13d943,stroke:#121316,stroke-width:2.5px
style sched fill:#f2ddb0,stroke:#121316
Two safety properties are doing quiet work in that picture. The run executes
in a linked task the worker can yield on or
shutdown with a brutal kill — so a run that hangs is bounded at
fifteen minutes, and a run that crashes never takes the worker down with
it. The schedule outlives the run, which is failure mode number one,
solved. And the run is the same call path as the public /api/run
endpoint — the standing agent and a one-off request execute identically, which
is why a once-correct run is a forever-correct run.
a day declared in ORG
A schedule that just fires every hour has no shape. The lifecycle gives the day a shape — and it's a deterministic state machine declared in org, executed one transition per tick. The headings are states; the property drawers — the same drawer convention the workflow layer uses — carry the edges and gates:
#+START: wake_add * wake_add :PROPERTIES: :KIND: wake :REPEAT: 3 :NEXT: wake_audit :END: * wake_audit :PROPERTIES: :KIND: wake :NEXT: rem :END: * rem :PROPERTIES: :KIND: rem :NEXT: wake_plan :MIN-INTERVAL: 10m :END: * wake_plan :PROPERTIES: :KIND: wake :NEXT: wake_add :END:
Read it as a loop. #+START: names the entry state.
:KIND: wake runs the agent definition; :KIND: rem is
a quiet beat — it hands off to the dream phase and never
calls the model. :REPEAT: 3 means do wake_add three
successful ticks before taking its :NEXT: edge.
:MIN-INTERVAL: 10m on rem is a time gate — that state
refuses to run until ten minutes have passed since it last did. The canonical
loop is: add three times, audit once, rest, plan, and back to adding.
stateDiagram-v2
[*] --> wake_add
wake_add --> wake_add: done (hits < 3)
wake_add --> wake_audit: 3rd success
wake_audit --> rem: done
rem --> wake_plan: done · gated 10m
wake_plan --> wake_add: done
note right of rem
KIND rem = dream, no model
MIN-INTERVAL 10m time gate
end note
The state machine is the deterministic skeleton; what the agent does
inside wake_add stays non-deterministic — that's the definition's
job, not the engine's. And the spec file is re-read every tick, kept
deliberately dumb, so a spec you hot-edit is picked up next beat. You can change
the shape of a live agent's day without restarting it. This is distinct from a
workflow's plan.org — that's the ongoing
task DAG that never "completes"; the lifecycle is the recurring skeleton the
agent walks while it works that plan.
four outcomes, one rule EACH
Every tick ends in exactly one of four outcomes, and each one does a single,
predictable thing to the agent's position — the pair {state, hits}.
Learn these four rules and you can predict the next six ticks of any
lifecycle.
| outcome | what it means | effect on position | next delay |
|---|---|---|---|
| :done | real work happened | hits + 1; at :REPEAT: take :NEXT:, reset hits | hot cadence — base interval |
| :no_work | result began with NO-WORK | collapse remaining repeats, take :NEXT: now | backoff (see below) |
| :failed | the run errored | hold — retry the same state next tick | base interval |
| :killed | blew the 15-min wall clock | hold — retry the same state next tick | base interval |
The headline is the bottom two rows. A crash or a timeout holds position — the cadence clock is never lost, the agent retries the same state on its next beat. Your place in the day is durable against the worst thing a run can do.
The :no_work rule has a subtlety worth stating precisely: it's
a fast-forward of repeats, never a skip of states. If there's genuinely
nothing to add, wake_add collapses its remaining repeats and moves
on immediately — but every state in the declared order still runs. The
audit still happens, the rest still happens. The engine speeds through empty
repeats without ever letting a state get skipped, so cadenced work like audits
and dreams keeps its rhythm.
cadence that survives the MACHINE
Depth rung — skippable, but it's where "a model on a schedule" becomes trustworthy. The agent's whole sense of time lives in three tiny files on disk, which is what makes a redeploy resume mid-cadence instead of starting over.
| file | contents | what it preserves |
|---|---|---|
keeper-last-run | unix time of the last tick | catch-up math across a restart |
lifecycle-pos | literally wake_add 2 | your place — the 2nd of 3 adds |
lifecycle-ran-<state> | last time a gated state ran | the MIN-INTERVAL clock |
The catch-up arithmetic is one expression: on restart the next delay is
max(60s, interval − elapsed). If an hourly agent was forty minutes
into its hour when you redeployed, it waits the remaining twenty — not a fresh
hour. The sixty-second floor is a boot grace so a restart loop can't hammer the
model. Failure mode number two — the reset clock — solved by reading a file.
Two more details earn their keep. The current state is mirrored into process-memory so the public plane can read "where is this agent in its day" without a blocking call — and that matters because a tick runs synchronously, so asking a busy worker directly would make you wait the whole run. And a never-run gated state reads as infinitely elapsed, so its gate starts open. The position is a fact on disk, not a fact in a fragile process.
the price of IDLENESS
Depth rung. Failure mode number three — paying for idle polls — gets its own
mechanism, because it's the one that costs real money. The convention is blunt:
an agent with nothing to do ends its run with output that begins with the
literal word NO-WORK. The engine reads that and, instead of
running the next tick at full cadence, backs off exponentially.
The next delay on a NO-WORK streak is
max(base, 1 min · 2^(streak−1)), capped at thirty minutes.
For a continuous worker on the default 45-second breather, the schedule reads:
| NO-WORK streak | 0 | 1 | 2 | 3 | 4 | 5 | 6+ |
|---|---|---|---|---|---|---|---|
| next delay | 45s | 1m | 2m | 4m | 8m | 16m | 30m (cap) |
The verdict of that row: an idle agent quiets itself from roughly eighty
model calls an hour down to two. And the recovery is instant — a single
:done resets the streak and snaps the agent straight back to hot
cadence. It sleeps deeper the longer it's idle and wakes fully the moment there's
work. This is the engine's answer to the fair question doesn't a standing
agent cost a fortune doing nothing — no, because doing nothing is the one
thing it gets cheap at.
many workers, one BOX
Failure mode number five appears the instant you run two agents. The engine
runs a set of them from one manifest — headings are agent names, each with a
:DEF: (its definition), an optional :LIFECYCLE:, and
an :INTERVAL: fallback cadence. The full manifest grammar belongs
to the fleets deep dive; here's the shape:
* wren :PROPERTIES: :DEF: defs/writer.org :LIFECYCLE: lifecycle.org :INTERVAL: 10m :END: * moss :PROPERTIES: :DEF: defs/editor.org :INTERVAL: 20m :END:
Each member gets a full, independent worker — its own definition, its own lifecycle, its persistence namespaced by name. N agents tick in parallel with no shared cadence state. Two things keep that from becoming chaos. First, a stagger: the i-th worker's first tick is delayed by thirty seconds times its index, so they don't all hit the model at boot. Second, a concurrency gate — a counting semaphore, default max two — that every run must pass through.
sequenceDiagram participant A as worker wren participant G as the Gate (max 2) participant B as worker moss A->>G: acquire — slot free, granted Note over A: run (holds a slot) B->>G: acquire — slots full, parks FIFO Note over A: run blows the 15-min wall clock A-->>G: release (after-clause, even on timeout) G-->>B: slot handed to queue head Note over B: run begins
The gate's important property is in that last exchange. The worker wraps its
run so the slot is released in an after clause — which means a run
that crashes or blows its wall clock still returns its slot. A wedged
worker can never deadlock the others. The cap protects the box from a thundering
herd of model calls; it is not a fairness algorithm beyond first-in-first-out.
claims, not LOCKS
Now the subtle one. Two agents share a board — what stops them grabbing the
same task? The deliberate answer: the runtime locks nothing. Claiming is
a definition-level protocol, not a runtime-enforced mutex. An agent
claims a task by changing its board state and writing an
:AGENT: property — then committing that to git before it does
the work, where every peer can see it.
sequenceDiagram participant W as wren participant Git as the board (git) participant M as moss W->>Git: read board — task-7 is NEXT W->>Git: commit: task-7 → DOING · :AGENT: wren Note over W: now does the work M->>Git: tick — pull first Git-->>M: task-7 is DOING, claimed by wren Note over M: takes task-8 instead
The runtime's job is narrower and stronger than locking: it isolates runs — each worker has its own workdir and state — and it throttles them through the gate. Coordination correctness lives in the definition and the board, which is exactly where the non-deterministic interior of a state already lives. The git commit makes the claim visible, auditable, and revertible — a social protocol, the way a person grabs a ticket off the wall before starting, not a database row-lock. It narrows the race; it does not, by itself, eliminate it (more on that next).
where the EDGES are
Honesty section. This engine is small and rigid, and its limits come from exactly that.
- Claims are unenforced. A badly written definition that works before it pulls, or skips the commit-first discipline, can double-grab a task. The protocol makes a collision visible and auditable — it doesn't make it impossible. Correctness is culture here, carried by the definition.
- Failed states retry forever. A state that keeps erroring holds position and retries on every beat. There's no poison-state escape hatch beyond the backoff slowing it down — a permanently broken state is a permanently stuck cadence until a human looks.
- The gate caps the box, not fairness. It bounds how many runs hit the model at once and serves its queue first-in-first-out. It makes no per-agent fairness guarantee beyond that order.
- Min-interval is elapsed time, not a calendar. A gated state fires when enough wall-clock has passed since it last ran — it is not "every weekday at nine." For calendar scheduling you want the workflow layer's timestamps, not a lifecycle gate.
- One box, no distributed keeper. The workers run on a single engine. There is no cross-machine cadence coordination — scaling past one box is not this design.
questions people actually ASK
Is this just cron?
No — three differences that matter. Cron fires on the clock and forgets everything else. This engine is outcome-aware (a crash holds your place, real work advances it), position-preserving across restarts (catch-up math from a file on disk), and cost-backed-off (an idle agent quiets itself instead of paying for every poll). Cron can't lose its place because it never had one.
Can I change a live agent's lifecycle?
Yes. The spec file is re-read every single tick, deliberately. Hot-edit the state machine and the next beat picks it up — no restart, no redeploy. The agent keeps its current position unless you removed the state it was standing in, in which case it resets to the spec's start.
What if two agents claim the same task?
The git commit makes the race visible and auditable rather than silent — you can see both claims in history. The definition's discipline of pulling before it claims narrows the window hard. The runtime itself does not lock, by design; coordination correctness lives in the definition layer, beside the rest of the agent's judgment.
How do I see where an agent is in its day?
The public plane exposes per-agent status — whether it's running, when it last ran, when it runs next, and its current lifecycle position. It reads from process memory, so it answers instantly even while a tick is mid-run. When an agent "isn't running," that status plus the on-disk position files are your two debugging tools.
What triggers a run right now?
A single watched manual tick — it sends the worker the same
:tick message the timer would, immediately, for validation. It's
the one-off you reach for when you want to see a run with your own eyes rather
than wait for the cadence.
Where does a rest beat actually go?
A rem state hands off to the dream
phase — a consolidation step that runs without the model. The
MIN-INTERVAL on the state is what keeps that phase on a humane
rhythm rather than every beat.
keep GOING
This deep dive is the when around the agent's what — the parent and its neighbors fill in the rest.