who is awake at 3AM?
The nexus lesson made a promise: a docked workbook is stateful — schedules fire overnight, the work keeps moving while the page is closed. That's a lovely sentence. It's also the exact sentence every agent demo breaks. The terminal closes, the session ends, the context window evaporates, and the "always-on agent" turns out to be a person holding a laptop open.
So this page is the honest version of that promise. Something has to be awake at three in the morning, and it has to be a real process you can name — not a vibe in a chat window. The fair questions are blunt: what process is ticking, what decides what it does on each tick, and what happens when that process crashes, hangs forever, or wakes up to find there's nothing to do? A scheduler that can't answer the third question isn't a scheduler — it's a hope.
The answer is a keeper, and the satisfying part is how boring it turns out to be. No platform, no daemon mesh — one supervised Elixir worker, a timer message, a wall-clock bound, and a text file holding the last-run timestamp. About four hundred lines. The rest of this page is that worker, walked end to end.
the DEFINITION
1. a supervised worker that ticks an agent definition on a cadence, entirely on-box — reading a loaded artifact, never calling the public plane. One per agent; the runtime owns when, the def owns what.
A keeper belongs entirely to the host layer. It
reads a loaded artifact — your .org agent def — but it never
touches the public plane and never crosses the Dock membrane. When it runs the
agent, it goes through the exact same AgentDef.run path as
an interactive API call. A scheduled run and a request you typed are the same
code; only the trigger differs.
There's a trinity of names worth fixing now, because they recur. The worker is the tick engine — instantiable, one per agent. The keeper is the singleton facade: it starts exactly one worker under a legacy name, with zero-regression persistence keys, for the case of a single standing agent. The crew is a supervisor that starts one worker per agent from a manifest. The singleton is just a crew of one. Everything below is the worker; the other two are wrappers around it.
anatomy of a TICK
Here is one tick, whole. The worker wakes on a timer message, and in order: it writes the current time to its last-run file; it optionally reconciles git (pulling human and CI pushes, opt-in); it decides what this tick is (a lifecycle-aware step, covered below); it runs the agent def inside a linked task with a wall-clock bound; it reads the outcome; and it schedules the next tick. That's the loop.
sequenceDiagram
participant T as timer
participant W as keeper worker
participant G as git (tenant repo)
participant A as AgentDef.run
participant R as tenant repo + disk
T->>W: :tick
W->>R: write keeper-last-run (unix seconds)
opt WB_GITOPS=1
W->>G: pull origin — integrate human / CI pushes
end
W->>W: decide this tick (lifecycle step)
W->>A: run def — workdir = page's git repo
Note over A: linked Task · 15-min wall clock
A-->>W: :done | :failed | :killed | :no_work
W->>R: commit (if work done) — IS the changelog
W->>T: schedule next tick (delay from outcome)
Three details make this survivable rather than fragile. First, the run
executes in a linked task with a wall-clock bound — default fifteen
minutes. The worker waits with Task.yield; if the run blows the
clock, Task.shutdown kills it brutally and the outcome is
:killed. A wedged run can't hang the schedule forever.
Second, the worker traps exits. A crash inside the run task — and an LLM-driven agent crashes in a hundred ways — never takes the worker down. The exit message is absorbed; the worker reschedules and lives on. Third, the agent reports its result by the four outcomes above, and each one steers the next delay differently. The run itself is one line, and it carries the tenant: the working directory is the page's git repo, so every commit the keeper makes is the public changelog. (A real war story lives in a code comment here: before the tenant was threaded through correctly, the landing site committed four posts that all 404'd — they'd auto-published to a build directory instead of the served root. The fix was making the run's workdir the tenant repo, not guessing.)
time that survives RESTARTS
The most important fact about a keeper is where its schedule lives. Not in
RAM. On disk. Each tick writes the current unix time to a plain text file —
keeper-last-run — on the data volume, beside the work. On boot,
the worker reads that file and computes its first delay as
max(60s, interval − elapsed). A redeploy never resets the cadence
clock; it picks up where it left off.
flowchart TD
boot([worker boots]) --> read{keeper-last-run
exists?}
read -- no --> first["first tick after the
60s boot-grace floor"]
read -- yes --> calc["elapsed = now − last-run
delay = max 60s, interval − elapsed"]
calc --> due{elapsed ≥
interval?}
due -- yes --> grace["fire in 60s
boot-grace floor"]
due -- no --> wait["fire in the remainder"]
style first fill:#aee5c2,stroke:#121316
style grace fill:#13d943,stroke:#121316,stroke-width:2.5px
style wait fill:#ffffff,stroke:#121316
Walk the branches. If there's no last-run file — a fresh keeper — the first
tick fires after the sixty-second boot-grace floor, not on boot.
Booting isn't doing work, and the grace floor keeps a flapping deployment from
hammering the model the instant the engine comes up. If the file exists and the
interval already elapsed while you were down, the tick is due — but that same
sixty-second boot-grace floor still applies, so a restart can't fire a run on
the spot. Otherwise: fire in the remainder. A keeper that last ran eleven minutes ago, on a fifteen-minute
interval, redeployed right now, ticks in four minutes —
max(60s, 900s − 660s) — not fifteen.
There's a second mode for agents that should run nearly continuously. Continuous mode replaces the full interval between ticks with a short breather — default forty-five seconds — so the agent loops with a pause to catch its breath instead of sleeping a full hour. Same machinery, a different gap.
what each tick IS
Depth rung — skippable, but it's where "a timer that runs an agent" becomes a real orchestrator. So far each tick has been the same: run the def. But the orchestrator agent has a true workflow — a deterministic state machine declared in native org and executed one transition per tick. The org file owns the spec; the runtime just interprets and steps it. What the agent does inside a state stays non-deterministic — that's the def's job. The skeleton is deterministic — that's this file's job.
Here's the canonical loop, verbatim. Headings are states; the drawer properties are the edges and gates:
#+START: wake_add * wake_add :PROPERTIES: :KIND: wake ← run the agent def :REPEAT: 3 ← hold for 3 successful ticks :NEXT: wake_audit :END: * wake_audit :PROPERTIES: :KIND: wake :NEXT: rem :END: * rem :PROPERTIES: :KIND: rem ← dream — no agent runs :NEXT: wake_plan :MIN-INTERVAL: 10m ← time gate; a gated tick is a no-op :END: * wake_plan :PROPERTIES: :KIND: wake :NEXT: wake_add ← back to the top :END:
The loop is: add three times, audit once, dream (if at least ten minutes
have passed since the last dream), plan, and back to adding. A wake
state runs the agent; a rem state skips the agent entirely and
dreams instead — consolidating the cycle's telemetry, no model call against the
page. Position is a pair, {state, hits}, and it persists to
lifecycle-pos beside the last-run file. A redeploy resumes
mid-cadence — on the 2nd of 3 adds, the file literally reading
wake_add 2.
The state machine, with its edges:
stateDiagram-v2 [*] --> wake_add wake_add --> wake_add: :done (hits < 3) wake_add --> wake_audit: :done ×3 — or :no_work fast-forwards wake_audit --> rem: :done rem --> wake_plan: dream done
(gated 10m → no-op, hold) wake_plan --> wake_add: :done note right of wake_add :failed / :killed → hold position, retry same state next tick end note
The transition rule is the crisp part. On :done, hits go up by
one; at :REPEAT:, take :NEXT: and reset hits. On
:no_work, collapse the remaining repeats and take the edge
now — but this is a fast-forward of repeats only, never a skip:
every state in the declared order still runs, so the audit and the dream keep
their cadence. On :failed or :killed, hold position
and retry the same state next tick — the global failure rule, so cadence
survives crashes and timeouts. And the spec file is re-read every single tick,
kept deliberately dumb, so a hot-edited schedule is picked up live — no
redeploy to change the loop.
A worked trace makes the rules concrete. Start at wake_add:
| tick | state (hits) | outcome | next position |
|---|---|---|---|
| 1 | wake_add (0) | :done | wake_add (1) |
| 2 | wake_add (1) | :failed | wake_add (1) — retry, cadence held |
| 3 | wake_add (1) | :done | wake_add (2) |
| 4 | wake_add (2) | :no_work | wake_audit (0) — repeats collapsed, never skipped |
| 5 | wake_audit (0) | :done | rem (0) |
| 6 | rem — 4 min since last dream | gated | no-op, position held |
Read tick 2 and tick 4 together: a failure costs you nothing but a retry, and a no-work doesn't let the agent dodge its audit — it just stops grinding adds that have nothing left to add. Tick 6 is the time gate doing its job: the dream wants ten minutes of distance, only four have passed, so the tick is a no-op and the position waits.
the economics of doing NOTHING
Every wake tick is an LLM call, and an LLM call costs real money. So the expensive question isn't "what does a busy keeper do?" — it's "what does an idle one cost?" The honest answer used to be: too much. An early crew burned a model call every ten seconds just to report there was nothing to do. Two mechanisms fixed that, and they're worth understanding because the economics are designed in, not bolted on.
First, the NO-WORK protocol. An agent signals an empty tick by
beginning its result with the literal token NO-WORK. The worker
matches that prefix and records the outcome as :no_work — no
commit, no fuss. The agent decides; the runtime reads.
Second, exponential idle backoff. Consecutive no-work ticks stretch
the gap: the next delay is max(base cadence, 60s · 2^(streak−1)),
capped at thirty minutes. One :done snaps the cadence straight
back to hot. An idle keeper cools off geometrically; a keeper that finds work
is instantly responsive again.
| no-work streak | backoff term | next delay |
|---|---|---|
| 0 (just did work) | — | base cadence (hot) |
| 1 | 60s · 2⁰ | 1 min |
| 2 | 60s · 2¹ | 2 min |
| 3 | 60s · 2² | 4 min |
| 4 | 60s · 2³ | 8 min |
| 5 | 60s · 2⁴ | 16 min |
| 6+ | 60s · 2⁵… | 30 min (capped) |
The verdict of that table in one line: a keeper with nothing to do settles into a thirty-minute heartbeat within six idle ticks — and the moment it finds work, it's back to full cadence on the next beat. Doing nothing should be nearly free, and here it is.
from one keeper to a NEWSROOM
One standing agent is the singleton. Several agents ticking together is a
crew — a supervisor that starts one worker per member, read from a
manifest. The manifest is plain org: headings are agent names, and a
:PROPERTIES: drawer carries each member's config. Here's the real
bit.ml newsroom:
* desk :PROPERTIES: :DEF: /data/agents/desk.org :INTERVAL: 45m :END: * moss :PROPERTIES: :DEF: /data/agents/researcher.org :INTERVAL: 15m :END: * wren :PROPERTIES: :DEF: /data/agents/writer.org :INTERVAL: 15m :END: * hale :PROPERTIES: :DEF: /data/agents/editor.org :INTERVAL: 20m :END:
Four agents — a desk editor, a researcher, a writer, a copy editor — each
on its own interval, each running its own def. :DEF: is required
(a member without it is skipped); :INTERVAL: accepts durations
like 15m, 2h, 90s, or bare
milliseconds, defaulting to an hour; an optional :LIFECYCLE:
points at a state-machine spec, and its absence means plain interval ticks.
The manifest is re-read each call, so adding an agent doesn't need a redeploy.
Two design choices keep a crew from being a mess. Staggered boots:
worker i gets a first-tick delay of i times a stagger
(default thirty seconds), so four agents fire at zero, thirty, sixty, and
ninety seconds — they overlap visibly without a thundering herd of LLM calls
at boot. And fully namespaced state: every persistence file is
per-agent — keeper-last-run-wren, lifecycle-pos-wren
— so N agents tick independently with no shared cadence clock. The two modes
are mutually exclusive in the supervision tree: a crew manifest takes
precedence, then a single def, then neither child exists at all.
| singleton | crew | |
|---|---|---|
| activated by | WB_KEEPER_DEF | WB_CREW_DEF (takes precedence) |
| workers | one, legacy name | one per manifest member |
| last-run file | keeper-last-run | keeper-last-run-<name> |
| lifecycle file | lifecycle-pos | lifecycle-pos-<name> |
| status key | {Keeper, :status} | {Keeper, :status, "-<name>"} |
| concurrency gate | no-op | FIFO semaphore, default 2 |
the gate and the CLAIM
Depth rung. Four agents on fifteen-minute intervals will sometimes wake at
once, and four simultaneous LLM-driven runs is a way to set money on fire. So a
crew shares one global concurrency cap — a counting semaphore, default
two concurrent runs. A worker must acquire a slot before it runs
and release it after. The release runs in an after
block, so a crashed or timed-out run always returns its slot — a
failure can't leak capacity.
sequenceDiagram participant M as moss participant W as wren participant H as hale participant G as gate (2 slots) M->>G: acquire ✓ (slot 1) W->>G: acquire ✓ (slot 2) H->>G: acquire — parks (FIFO queue) Note over H,G: hale blocks with :infinity
no starvation M->>G: release (run done) G->>H: slot handed directly to queue head Note over H: hale runs now
The queue is FIFO with an infinite wait, so no agent starves; release hands the slot directly to the head of the queue rather than racing for it. The singleton passes a no-op acquire and release — one agent never contends.
Now the sharp line. You might expect the runtime to lock a task so
two agents can't grab the same one. It doesn't. Board claims are a
def-level protocol, not a runtime lock. An agent claims work by making a
board state change and committing an :AGENT: property to git
before it starts — a task flipped to DOING with :AGENT: wren
on it. That commit is visible to every peer in git before any work happens. The
runtime's job is narrower and honest: it isolates runs and throttles
concurrency. Coordination is the agents' protocol, played out in the open, in
version control — not a mutex the platform holds.
watching it WORK
A process awake at 3am is only trustworthy if you can see it. The public
plane exposes two read-only endpoints, and they read the keeper's state
without ever blocking it. GET /_changes returns the recent git log
plus the keeper's status — the commits the keeper made are the
changelog. GET /_activity returns live telemetry: for a singleton,
the agent's status, its last few steps, and its current thought; for a crew,
a per-agent breakdown plus a merged wire of recent steps, each tagged
with the agent that emitted it.
The way status is read matters more than it looks. It's published to
:persistent_term — a lock-free shared term — and read from there,
never via a call to the worker. The reason is exact: a tick runs
synchronously inside the worker, so a status call would block for the entire
run, up to fifteen minutes. Reading the published term instead means the
dashboard is always instant even while the agent is mid-thought. Here's the
crew shape:
{ "agents": [
{ "name": "wren", "running": true,
"lifecycle": { "state": "wake_add", "hits": 1 },
"steps": [ ... last 5 ... ], "thought": "..." } ],
"wire": [ /* recent steps, each tagged with its agent */ ],
"agent": { /* legacy block: the busiest / most-recent agent */ } }
You can also nudge a keeper by hand: run_once sends a single
watched tick immediately, useful for testing a def without waiting out the
interval. And the idle keeper has a voice — between substantive runs it can
daydream, a short public-only note, never committed; a full
dream happens only after an audit and past a staleness gate. The
living landing site is the proof of all of this: its keeper agent — Waldo,
the Workbook Autonomous Live Document Operator — runs every fifteen minutes,
the fourth run each hour being the audit pass, and every commit it makes
appears in the page's right-hand timeline. Open copies of the page act it out
with a live cursor. The schedule is real, the agent is real, and you can watch
it from a browser.
schedules in the SOURCE
There are several places declared time lives in this ecosystem, and it's
worth being precise about which one actually ticks. A
workflow headline can carry a :SCHEDULE:
cron property or a native SCHEDULED: timestamp; the kernel
surfaces that as a schedule field per workflow, which
Workflow.list returns and the plan endpoint
(POST /api/workflow?plan=1) reports without executing anything. A
crew member declares its own cadence with :INTERVAL:. A lifecycle
state declares a time gate with :MIN-INTERVAL:.
Here's the honest framing, because it's easy to blur. Nothing in the host
cron-executes a workflow's :SCHEDULE: on its own — that
field is declared and surfaced metadata, plan output and run records.
The thing that actually wakes up and runs on a clock is the keeper: its
interval, its lifecycle, its gates. The workflow schedule is what a keeper's
agent reads and acts on; the keeper is the engine that does the waking. Don't
picture a built-in cron daemon firing :SCHEDULE: lines — picture a
keeper ticking, and an agent inside it consulting the plan.
what keepers AREN'T
Honesty section. A keeper is an interval-and-lifecycle ticker, and it's sharpest when you know what it isn't.
It is not a cron engine for workflow schedules. As the last section
said, :SCHEDULE: on a workflow is surfaced metadata today, not an
autonomous trigger. The keeper's own interval is the live clock; the workflow
schedule is a thing its agent reads.
It is not a task-lock service. Two agents avoid grabbing the same
task through a def-level protocol committed to git — a visible
:AGENT: claim made before the work — not through a runtime mutex.
The runtime isolates and throttles; it does not adjudicate ownership.
It is not free. Every wake tick is a model call with a real bill, which is the whole reason the NO-WORK protocol and exponential backoff exist — the economics are a first-class design concern, not an afterthought. And it is not transactional. The fifteen-minute wall-clock kill is genuinely brutal — a wedged run is shut down hard, mid-work. There's no rollback; a half-finished run is recovered the honest way, by git and a retry of the same state next tick. The cadence survives the crash; the partial work is just abandoned and redone.
questions people actually ASK
What happens if a run hangs forever?
It's killed. Each run executes in a linked task with a wall-clock bound —
fifteen minutes by default. The worker waits with a yield; if the run blows
the clock it's shut down brutally, the outcome is :killed, and
under a lifecycle the position holds so the same state retries next tick. A
hung run costs you one wasted interval, not your schedule.
Does a redeploy reset the schedule?
No — because the schedule isn't in memory, it's in files on the data
volume. The last-run timestamp lives in keeper-last-run and the
lifecycle position in lifecycle-pos. On boot the worker reads
them and computes its next delay as the remainder of the interval, so it
resumes mid-cadence — on the 2nd of 3 adds, not back at the top.
How do two agents avoid grabbing the same task?
By a def-level protocol, not a runtime lock. An agent claims work by
committing an :AGENT: property to git before it starts — visible
to every peer in version control. The runtime only isolates runs and caps
concurrency at, by default, two at once. Coordination happens in the open, in
git, where you can read it.
Can I change the cadence without redeploying?
Yes. The lifecycle spec is re-read on every tick — kept deliberately dumb — so a hot edit to the state machine is picked up live. The crew manifest is re-read each call too. You change the org file; the next tick honors it. No rebuild, no restart.
What does an idle keeper cost?
Almost nothing. An agent signals an empty tick by prefixing its result
with NO-WORK, and consecutive no-work ticks back off
exponentially — one minute, two, four, up to a thirty-minute cap. One real
unit of work snaps the cadence straight back to hot. Doing nothing settles
into a half-hour heartbeat within six ticks.
Is a keeper the same as the Autopoet?
No — they're peers. The Autopoet is a standing agent that tends the system's own config; a keeper is the mechanism that ticks any standing agent on a cadence. They sit side by side in the same supervision tree. A keeper could run the Autopoet; the Autopoet is not a kind of keeper.
keep GOING
This sub-lesson proves a promise the parent made — start there if the process model is new.