learn / 02·4 — under nexus · keepers

the processAWAKEat three a.m.

A keeper is the runtime's own scheduler — a supervised worker that ticks an agent definition on a cadence, entirely on-box. The runtime owns when: intervals, gates, backoff, all declared. The def owns what: the non-deterministic interior of a tick. And the schedule itself lives on disk, so a redeploy resumes mid-cadence — never resets the clock.

keepers13 min read
A lone watchkeeper figure dwarfed before a monumental brass clock-tower mechanism in a vast hall, one small lamp lit against towering gears that turn slowly on their own — bright amber and green, 1970s sci-fi style

who is awake at 3AM?

The nexus lesson made a promise: a docked workbook is stateful — schedules fire overnight, the work keeps moving while the page is closed. That's a lovely sentence. It's also the exact sentence every agent demo breaks. The terminal closes, the session ends, the context window evaporates, and the "always-on agent" turns out to be a person holding a laptop open.

So this page is the honest version of that promise. Something has to be awake at three in the morning, and it has to be a real process you can name — not a vibe in a chat window. The fair questions are blunt: what process is ticking, what decides what it does on each tick, and what happens when that process crashes, hangs forever, or wakes up to find there's nothing to do? A scheduler that can't answer the third question isn't a scheduler — it's a hope.

The answer is a keeper, and the satisfying part is how boring it turns out to be. No platform, no daemon mesh — one supervised Elixir worker, a timer message, a wall-clock bound, and a text file holding the last-run timestamp. About four hundred lines. The rest of this page is that worker, walked end to end.

the DEFINITION

keep·er /ˈkiː·pər/ noun

1. a supervised worker that ticks an agent definition on a cadence, entirely on-box — reading a loaded artifact, never calling the public plane. One per agent; the runtime owns when, the def owns what.

A keeper belongs entirely to the host layer. It reads a loaded artifact — your .org agent def — but it never touches the public plane and never crosses the Dock membrane. When it runs the agent, it goes through the exact same AgentDef.run path as an interactive API call. A scheduled run and a request you typed are the same code; only the trigger differs.

There's a trinity of names worth fixing now, because they recur. The worker is the tick engine — instantiable, one per agent. The keeper is the singleton facade: it starts exactly one worker under a legacy name, with zero-regression persistence keys, for the case of a single standing agent. The crew is a supervisor that starts one worker per agent from a manifest. The singleton is just a crew of one. Everything below is the worker; the other two are wrappers around it.

anatomy of a TICK

Here is one tick, whole. The worker wakes on a timer message, and in order: it writes the current time to its last-run file; it optionally reconciles git (pulling human and CI pushes, opt-in); it decides what this tick is (a lifecycle-aware step, covered below); it runs the agent def inside a linked task with a wall-clock bound; it reads the outcome; and it schedules the next tick. That's the loop.

sequenceDiagram
  participant T as timer
  participant W as keeper worker
  participant G as git (tenant repo)
  participant A as AgentDef.run
  participant R as tenant repo + disk
  T->>W: :tick
  W->>R: write keeper-last-run (unix seconds)
  opt WB_GITOPS=1
    W->>G: pull origin — integrate human / CI pushes
  end
  W->>W: decide this tick (lifecycle step)
  W->>A: run def — workdir = page's git repo
  Note over A: linked Task · 15-min wall clock
  A-->>W: :done | :failed | :killed | :no_work
  W->>R: commit (if work done) — IS the changelog
  W->>T: schedule next tick (delay from outcome)
  

Three details make this survivable rather than fragile. First, the run executes in a linked task with a wall-clock bound — default fifteen minutes. The worker waits with Task.yield; if the run blows the clock, Task.shutdown kills it brutally and the outcome is :killed. A wedged run can't hang the schedule forever.

Second, the worker traps exits. A crash inside the run task — and an LLM-driven agent crashes in a hundred ways — never takes the worker down. The exit message is absorbed; the worker reschedules and lives on. Third, the agent reports its result by the four outcomes above, and each one steers the next delay differently. The run itself is one line, and it carries the tenant: the working directory is the page's git repo, so every commit the keeper makes is the public changelog. (A real war story lives in a code comment here: before the tenant was threaded through correctly, the landing site committed four posts that all 404'd — they'd auto-published to a build directory instead of the served root. The fix was making the run's workdir the tenant repo, not guessing.)

time that survives RESTARTS

The most important fact about a keeper is where its schedule lives. Not in RAM. On disk. Each tick writes the current unix time to a plain text file — keeper-last-run — on the data volume, beside the work. On boot, the worker reads that file and computes its first delay as max(60s, interval − elapsed). A redeploy never resets the cadence clock; it picks up where it left off.

flowchart TD
  boot([worker boots]) --> read{keeper-last-run
exists?} read -- no --> first["first tick after the
60s boot-grace floor"] read -- yes --> calc["elapsed = now − last-run
delay = max 60s, interval − elapsed"] calc --> due{elapsed ≥
interval?} due -- yes --> grace["fire in 60s
boot-grace floor"] due -- no --> wait["fire in the remainder"] style first fill:#aee5c2,stroke:#121316 style grace fill:#13d943,stroke:#121316,stroke-width:2.5px style wait fill:#ffffff,stroke:#121316

Walk the branches. If there's no last-run file — a fresh keeper — the first tick fires after the sixty-second boot-grace floor, not on boot. Booting isn't doing work, and the grace floor keeps a flapping deployment from hammering the model the instant the engine comes up. If the file exists and the interval already elapsed while you were down, the tick is due — but that same sixty-second boot-grace floor still applies, so a restart can't fire a run on the spot. Otherwise: fire in the remainder. A keeper that last ran eleven minutes ago, on a fifteen-minute interval, redeployed right now, ticks in four minutes — max(60s, 900s − 660s) — not fifteen.

There's a second mode for agents that should run nearly continuously. Continuous mode replaces the full interval between ticks with a short breather — default forty-five seconds — so the agent loops with a pause to catch its breath instead of sleeping a full hour. Same machinery, a different gap.

what each tick IS

Depth rung — skippable, but it's where "a timer that runs an agent" becomes a real orchestrator. So far each tick has been the same: run the def. But the orchestrator agent has a true workflow — a deterministic state machine declared in native org and executed one transition per tick. The org file owns the spec; the runtime just interprets and steps it. What the agent does inside a state stays non-deterministic — that's the def's job. The skeleton is deterministic — that's this file's job.

Here's the canonical loop, verbatim. Headings are states; the drawer properties are the edges and gates:

#+START: wake_add

* wake_add
:PROPERTIES:
:KIND: wake          ← run the agent def
:REPEAT: 3           ← hold for 3 successful ticks
:NEXT: wake_audit
:END:

* wake_audit
:PROPERTIES:
:KIND: wake
:NEXT: rem
:END:

* rem
:PROPERTIES:
:KIND: rem           ← dream — no agent runs
:NEXT: wake_plan
:MIN-INTERVAL: 10m   ← time gate; a gated tick is a no-op
:END:

* wake_plan
:PROPERTIES:
:KIND: wake
:NEXT: wake_add      ← back to the top
:END:

The loop is: add three times, audit once, dream (if at least ten minutes have passed since the last dream), plan, and back to adding. A wake state runs the agent; a rem state skips the agent entirely and dreams instead — consolidating the cycle's telemetry, no model call against the page. Position is a pair, {state, hits}, and it persists to lifecycle-pos beside the last-run file. A redeploy resumes mid-cadence — on the 2nd of 3 adds, the file literally reading wake_add 2.

The state machine, with its edges:

stateDiagram-v2
  [*] --> wake_add
  wake_add --> wake_add: :done (hits < 3)
  wake_add --> wake_audit: :done ×3 — or :no_work fast-forwards
  wake_audit --> rem: :done
  rem --> wake_plan: dream done
(gated 10m → no-op, hold) wake_plan --> wake_add: :done note right of wake_add :failed / :killed → hold position, retry same state next tick end note

The transition rule is the crisp part. On :done, hits go up by one; at :REPEAT:, take :NEXT: and reset hits. On :no_work, collapse the remaining repeats and take the edge now — but this is a fast-forward of repeats only, never a skip: every state in the declared order still runs, so the audit and the dream keep their cadence. On :failed or :killed, hold position and retry the same state next tick — the global failure rule, so cadence survives crashes and timeouts. And the spec file is re-read every single tick, kept deliberately dumb, so a hot-edited schedule is picked up live — no redeploy to change the loop.

A worked trace makes the rules concrete. Start at wake_add:

tickstate (hits)outcomenext position
1wake_add (0):donewake_add (1)
2wake_add (1):failedwake_add (1) — retry, cadence held
3wake_add (1):donewake_add (2)
4wake_add (2):no_workwake_audit (0) — repeats collapsed, never skipped
5wake_audit (0):donerem (0)
6rem — 4 min since last dreamgatedno-op, position held

Read tick 2 and tick 4 together: a failure costs you nothing but a retry, and a no-work doesn't let the agent dodge its audit — it just stops grinding adds that have nothing left to add. Tick 6 is the time gate doing its job: the dream wants ten minutes of distance, only four have passed, so the tick is a no-op and the position waits.

the economics of doing NOTHING

Every wake tick is an LLM call, and an LLM call costs real money. So the expensive question isn't "what does a busy keeper do?" — it's "what does an idle one cost?" The honest answer used to be: too much. An early crew burned a model call every ten seconds just to report there was nothing to do. Two mechanisms fixed that, and they're worth understanding because the economics are designed in, not bolted on.

First, the NO-WORK protocol. An agent signals an empty tick by beginning its result with the literal token NO-WORK. The worker matches that prefix and records the outcome as :no_work — no commit, no fuss. The agent decides; the runtime reads.

Second, exponential idle backoff. Consecutive no-work ticks stretch the gap: the next delay is max(base cadence, 60s · 2^(streak−1)), capped at thirty minutes. One :done snaps the cadence straight back to hot. An idle keeper cools off geometrically; a keeper that finds work is instantly responsive again.

no-work streakbackoff termnext delay
0 (just did work)base cadence (hot)
160s · 2⁰1 min
260s · 2¹2 min
360s · 2²4 min
460s · 2³8 min
560s · 2⁴16 min
6+60s · 2⁵…30 min (capped)

The verdict of that table in one line: a keeper with nothing to do settles into a thirty-minute heartbeat within six idle ticks — and the moment it finds work, it's back to full cadence on the next beat. Doing nothing should be nearly free, and here it is.

from one keeper to a NEWSROOM

One standing agent is the singleton. Several agents ticking together is a crew — a supervisor that starts one worker per member, read from a manifest. The manifest is plain org: headings are agent names, and a :PROPERTIES: drawer carries each member's config. Here's the real bit.ml newsroom:

* desk
:PROPERTIES:
:DEF: /data/agents/desk.org
:INTERVAL: 45m
:END:
* moss
:PROPERTIES:
:DEF: /data/agents/researcher.org
:INTERVAL: 15m
:END:
* wren
:PROPERTIES:
:DEF: /data/agents/writer.org
:INTERVAL: 15m
:END:
* hale
:PROPERTIES:
:DEF: /data/agents/editor.org
:INTERVAL: 20m
:END:

Four agents — a desk editor, a researcher, a writer, a copy editor — each on its own interval, each running its own def. :DEF: is required (a member without it is skipped); :INTERVAL: accepts durations like 15m, 2h, 90s, or bare milliseconds, defaulting to an hour; an optional :LIFECYCLE: points at a state-machine spec, and its absence means plain interval ticks. The manifest is re-read each call, so adding an agent doesn't need a redeploy.

Two design choices keep a crew from being a mess. Staggered boots: worker i gets a first-tick delay of i times a stagger (default thirty seconds), so four agents fire at zero, thirty, sixty, and ninety seconds — they overlap visibly without a thundering herd of LLM calls at boot. And fully namespaced state: every persistence file is per-agent — keeper-last-run-wren, lifecycle-pos-wren — so N agents tick independently with no shared cadence clock. The two modes are mutually exclusive in the supervision tree: a crew manifest takes precedence, then a single def, then neither child exists at all.

singletoncrew
activated byWB_KEEPER_DEFWB_CREW_DEF (takes precedence)
workersone, legacy nameone per manifest member
last-run filekeeper-last-runkeeper-last-run-<name>
lifecycle filelifecycle-poslifecycle-pos-<name>
status key{Keeper, :status}{Keeper, :status, "-<name>"}
concurrency gateno-opFIFO semaphore, default 2

the gate and the CLAIM

Depth rung. Four agents on fifteen-minute intervals will sometimes wake at once, and four simultaneous LLM-driven runs is a way to set money on fire. So a crew shares one global concurrency cap — a counting semaphore, default two concurrent runs. A worker must acquire a slot before it runs and release it after. The release runs in an after block, so a crashed or timed-out run always returns its slot — a failure can't leak capacity.

sequenceDiagram
  participant M as moss
  participant W as wren
  participant H as hale
  participant G as gate (2 slots)
  M->>G: acquire ✓ (slot 1)
  W->>G: acquire ✓ (slot 2)
  H->>G: acquire — parks (FIFO queue)
  Note over H,G: hale blocks with :infinity
no starvation M->>G: release (run done) G->>H: slot handed directly to queue head Note over H: hale runs now

The queue is FIFO with an infinite wait, so no agent starves; release hands the slot directly to the head of the queue rather than racing for it. The singleton passes a no-op acquire and release — one agent never contends.

Now the sharp line. You might expect the runtime to lock a task so two agents can't grab the same one. It doesn't. Board claims are a def-level protocol, not a runtime lock. An agent claims work by making a board state change and committing an :AGENT: property to git before it starts — a task flipped to DOING with :AGENT: wren on it. That commit is visible to every peer in git before any work happens. The runtime's job is narrower and honest: it isolates runs and throttles concurrency. Coordination is the agents' protocol, played out in the open, in version control — not a mutex the platform holds.

watching it WORK

A process awake at 3am is only trustworthy if you can see it. The public plane exposes two read-only endpoints, and they read the keeper's state without ever blocking it. GET /_changes returns the recent git log plus the keeper's status — the commits the keeper made are the changelog. GET /_activity returns live telemetry: for a singleton, the agent's status, its last few steps, and its current thought; for a crew, a per-agent breakdown plus a merged wire of recent steps, each tagged with the agent that emitted it.

The way status is read matters more than it looks. It's published to :persistent_term — a lock-free shared term — and read from there, never via a call to the worker. The reason is exact: a tick runs synchronously inside the worker, so a status call would block for the entire run, up to fifteen minutes. Reading the published term instead means the dashboard is always instant even while the agent is mid-thought. Here's the crew shape:

{ "agents": [
    { "name": "wren", "running": true,
      "lifecycle": { "state": "wake_add", "hits": 1 },
      "steps": [ ... last 5 ... ], "thought": "..." } ],
  "wire":  [ /* recent steps, each tagged with its agent */ ],
  "agent": { /* legacy block: the busiest / most-recent agent */ } }

You can also nudge a keeper by hand: run_once sends a single watched tick immediately, useful for testing a def without waiting out the interval. And the idle keeper has a voice — between substantive runs it can daydream, a short public-only note, never committed; a full dream happens only after an audit and past a staleness gate. The living landing site is the proof of all of this: its keeper agent — Waldo, the Workbook Autonomous Live Document Operator — runs every fifteen minutes, the fourth run each hour being the audit pass, and every commit it makes appears in the page's right-hand timeline. Open copies of the page act it out with a live cursor. The schedule is real, the agent is real, and you can watch it from a browser.

schedules in the SOURCE

There are several places declared time lives in this ecosystem, and it's worth being precise about which one actually ticks. A workflow headline can carry a :SCHEDULE: cron property or a native SCHEDULED: timestamp; the kernel surfaces that as a schedule field per workflow, which Workflow.list returns and the plan endpoint (POST /api/workflow?plan=1) reports without executing anything. A crew member declares its own cadence with :INTERVAL:. A lifecycle state declares a time gate with :MIN-INTERVAL:.

Here's the honest framing, because it's easy to blur. Nothing in the host cron-executes a workflow's :SCHEDULE: on its own — that field is declared and surfaced metadata, plan output and run records. The thing that actually wakes up and runs on a clock is the keeper: its interval, its lifecycle, its gates. The workflow schedule is what a keeper's agent reads and acts on; the keeper is the engine that does the waking. Don't picture a built-in cron daemon firing :SCHEDULE: lines — picture a keeper ticking, and an agent inside it consulting the plan.

what keepers AREN'T

Honesty section. A keeper is an interval-and-lifecycle ticker, and it's sharpest when you know what it isn't.

It is not a cron engine for workflow schedules. As the last section said, :SCHEDULE: on a workflow is surfaced metadata today, not an autonomous trigger. The keeper's own interval is the live clock; the workflow schedule is a thing its agent reads.

It is not a task-lock service. Two agents avoid grabbing the same task through a def-level protocol committed to git — a visible :AGENT: claim made before the work — not through a runtime mutex. The runtime isolates and throttles; it does not adjudicate ownership.

It is not free. Every wake tick is a model call with a real bill, which is the whole reason the NO-WORK protocol and exponential backoff exist — the economics are a first-class design concern, not an afterthought. And it is not transactional. The fifteen-minute wall-clock kill is genuinely brutal — a wedged run is shut down hard, mid-work. There's no rollback; a half-finished run is recovered the honest way, by git and a retry of the same state next tick. The cadence survives the crash; the partial work is just abandoned and redone.

questions people actually ASK

What happens if a run hangs forever?

It's killed. Each run executes in a linked task with a wall-clock bound — fifteen minutes by default. The worker waits with a yield; if the run blows the clock it's shut down brutally, the outcome is :killed, and under a lifecycle the position holds so the same state retries next tick. A hung run costs you one wasted interval, not your schedule.

Does a redeploy reset the schedule?

No — because the schedule isn't in memory, it's in files on the data volume. The last-run timestamp lives in keeper-last-run and the lifecycle position in lifecycle-pos. On boot the worker reads them and computes its next delay as the remainder of the interval, so it resumes mid-cadence — on the 2nd of 3 adds, not back at the top.

How do two agents avoid grabbing the same task?

By a def-level protocol, not a runtime lock. An agent claims work by committing an :AGENT: property to git before it starts — visible to every peer in version control. The runtime only isolates runs and caps concurrency at, by default, two at once. Coordination happens in the open, in git, where you can read it.

Can I change the cadence without redeploying?

Yes. The lifecycle spec is re-read on every tick — kept deliberately dumb — so a hot edit to the state machine is picked up live. The crew manifest is re-read each call too. You change the org file; the next tick honors it. No rebuild, no restart.

What does an idle keeper cost?

Almost nothing. An agent signals an empty tick by prefixing its result with NO-WORK, and consecutive no-work ticks back off exponentially — one minute, two, four, up to a thirty-minute cap. One real unit of work snaps the cadence straight back to hot. Doing nothing settles into a half-hour heartbeat within six ticks.

Is a keeper the same as the Autopoet?

No — they're peers. The Autopoet is a standing agent that tends the system's own config; a keeper is the mechanism that ticks any standing agent on a cadence. They sit side by side in the same supervision tree. A keeper could run the Autopoet; the Autopoet is not a kind of keeper.

keep GOING

This sub-lesson proves a promise the parent made — start there if the process model is new.