claims — honesty as a state machine

the talk outruns the BUILD

Founder speech is the densest spec a company ever produces. The 91-minute voice memo on the drive home, the pitch run-up paced around the kitchen, the meeting where the whole architecture finally clicked out loud — that is the product, described by the one person who holds all of it at once. And it all rots, identically, in a notes app, as a transcript nobody re-reads.

Worse than rot: the talk always outruns the build. Every founder ships sentences before they ship systems — you describe the thing you mean to make in the present tense long before it exists, because that is how you think it through. Which is fine, until nobody tracks which sentences are still fiction. The gap between the spoken system and the shipped system is invisible — right up until a buyer, or a security review, finds it for you and names it in the room.

The usual fix is to talk less, or to caveat everything into mush. Both are losses. The better move is to treat the talk as raw material and build a discipline that knows, line by line, which claims are load-bearing and which are air.

the DEFINITION

claim /kleɪm/ noun

1. a load-bearing spoken assertion, extracted from an untouched source as an org TODO item whose state is its honesty — moving from CLAIMED toward a verdict, and reaching a terminal state only with cited evidence: a file, an epic, a test, or a live run.

The move hiding in that definition is the whole lesson: a transcript is a source, not a document. The working document is the ledger you derive from it. And because org lets you declare your own TODO keyword ladder in a single line, the derived document can make honesty itself into a state machine — the same TODO → DONE you already trust, but with a conscience bolted on.

seven words of STATE machine

Here is the entire mechanism. One line, at the top of the audit file, declares a custom keyword ladder — this is real org grammar, the same #+TODO: line a workflow uses to name its states:

#+TODO: CLAIMED(c) VALIDATING(v) | KEPT(k) PARTIAL(p) UNKEPT(u) FUTURE(f) RETRACTED(r)

The | is the spine. Left of it are the two working states — a claim still in flight. Right of it are the five verdicts — terminal, and each forbidden without a citation. The letters in parentheses are Emacs fast-selection keys; you press c to stamp a headline CLAIMED. Read the line left to right and it is a life: a sentence gets pulled from the recording — CLAIMED — gets checked against the actual repo — VALIDATING — and lands on exactly one truth.

stateDiagram-v2
  [*] --> CLAIMED: extracted from the source
  CLAIMED --> VALIDATING: checking against the repo
  VALIDATING --> KEPT: it's true
  VALIDATING --> PARTIAL: half true
  VALIDATING --> UNKEPT: said, not built
  VALIDATING --> FUTURE: not yet, on purpose
  VALIDATING --> RETRACTED: the claim was wrong
  note right of VALIDATING: every terminal edge needs
a cited file, epic,
test, or live run

Walk that picture once. A claim enters at CLAIMED and can only leave for VALIDATING. From VALIDATING there are five exits — KEPT for what is true, PARTIAL for what is half true, UNKEPT for what was said but not built, FUTURE for what you deliberately have not built yet, and RETRACTED for a claim that turned out plain wrong. The note on the side is the rule that gives the whole diagram its weight: no claim crosses into a verdict without dragging evidence across with it. The seven words are the same shape as TODO → DONE — only here, reaching DONE means you proved something.

quote, then EVIDENCE

A claim entry has a fixed anatomy, and the discipline lives in keeping its two halves in two different voices. The headline carries the status and the claim as one sentence. The body carries a Quote: — the verbatim words from the source, never paraphrased — and then an Evidence: or Reality: line: what is actually true in the repo. One voice is what was said; the other is what is. Here is a real KEPT entry from the one live audit:

** KEPT a workbook is an HTML file: CSS + JS + WebAssembly + SQLite, gzip-packed
   Quote: "if you looked at a workbook, you would at first glance see an HTML
   file… natively zipped… a SQLite binary file."
   Evidence: the workbooks-authoring surface ships single-file .html
   mini-apps bundling a WASM runtime + SQLite + their own source.

The quote is the founder, word for word. The evidence is the repo, answering. They are allowed to disagree — that disagreement is the entire point of the document — so they must never be written by the same hand in the same breath. The fields and who fills them:

field	whose voice	what it must cite
headline — status + claim	the auditor	one verdict word from the ladder
Quote:	the source, verbatim	the untouched transcript — nothing invented
Evidence: / Reality:	the repo	a file, an epic, a test, or a live run

The verbatim rule is not pedantry. The moment you paraphrase the quote, you have quietly edited the claim toward the version you can defend — and the audit becomes a record of what you wish you had said. The source stays untouched precisely so the gap can stay honest.

the five VERDICTS

Depth rung — skippable. The ladder earns its keep on the right-hand side, so here is one real specimen of each terminal verdict, pulled straight from the audit of the 91-minute memo. Each pairs the claim with the evidence its verdict is allowed to cite:

verdict	a real claim	its cited reality
KEPT	compilers run inside the sandbox — C, Zig, Rust, Go, JS/TS	epic wb-zyl done; the compilers ghcr package; recipes under runtime/compilers/
PARTIAL	bash is emulated in WebAssembly, no native exec anywhere	the posix shape and the no-native-exec invariant exist — but a hardened interim real-exec tool still does too (epic wb-9ja removes it)
UNKEPT	most of Python compiles and runs in the sandbox	there is no Python lane in the compiler set — clang / mrustc / zig / go-yaegi / quickjs. Build it or stop saying it
FUTURE	a non-LLM, state-space model trained on runtime telemetry diffuses changes into the engine	deliberately unbuilt — named, scoped, and parked as direction, not pretended as shipped
RETRACTED	Erlang gives nine-9s fault tolerance — 99.999999999%	folklore from one Ericsson anecdote, widely contested — recommend dropping the number, keeping the true supervision-tree story

The UNKEPT and the RETRACTED are the ones that earn the discipline. The Python claim is the kind of sentence that sounds true because the sentence next to it is true — and an audit is the only thing that catches it before a customer does. The nine-9s line is sharper still: not half-built, but wrong — and the verdict's job is to say keep the story, drop the number, rather than quietly defend an indefensible statistic.

audio in, ledger OUT

The ledger is the end of a pipeline, and every step of it is agent-runnable — which is exactly why the whole method is a candidate toolkit: audio goes in, a claims ledger comes out. The real path that produced the one live audit:

flowchart LR
  rec["91-min voice memo
5454.4 sec"] --> stt["ElevenLabs
scribe_v1"]
  stt --> src
  subgraph src["sources/ — untouched, append-only"]
    v["verbatim.txt
8,578 words"]
    w["words.json
per-word timing + logprobs"]
  end
  src --> audit["audits/spoken-thesis.org
claims, evidence-checked"]
  audit --> back["backlog
every UNKEPT / PARTIAL"]
  audit -. "#+SOURCE: points back" .-> src
  style src fill:#fbfaf6,stroke:#121316
  style audit fill:#f2ddb0,stroke:#121316,stroke-width:2.5px
  style back fill:#13d943,stroke:#121316

Read the flow left to right. A 91-minute memo — 5,454 seconds — goes through ElevenLabs scribe_v1 and lands as two untouched files in a dated source directory: verbatim.txt, exactly 8,578 words, and words.json, which carries every word with its start time, end time, and the model's log-probability — the transcript knows not just what was said but when, and how sure it was. From those sources the audit is derived, and its #+SOURCE: line points back at them — the dotted arrow. The sources are never edited; the audit is the only thing that changes. And the output edge on the right is the payoff: every UNKEPT and every PARTIAL falls straight into a backlog.

The append-only rule on sources/ is what makes the whole thing trustworthy over time. You can re-audit the same memo a year from now and the words you are checking against have not drifted. The document moves; the truth it was built from does not.

the gap is the BACKLOG

This is the thesis the audit states about itself, verbatim: the gap between the spoken system and the shipped system is the backlog. An UNKEPT is not a confession — it is a ticket. The method does not stop at judging; it dispatches. Here is the loop that actually runs on this project, from a claim to filed work:

sequenceDiagram
  participant F as founder
  participant G as groundskeeper agent
  participant A as author model (mercury-2)
  participant W as workflow run
  participant L as the ledger
  F->>G: on a live call — name a weak claim
  G->>G: capture to sources/captures/.org
  G->>A: dispatch a research goal
  A->>W: author an org TODO outline (:done-when: gates)
  W->>L: results land on TASKS.org
  L->>F: next call picks up where this one left off

Trace that exchange. On a live call the founder names a claim that worries him — in the real capture file, workbook-as-container is flagged as the weakest pitch line. The groundskeeper agent saves the thought to a dated capture file and dispatches a research goal. An author model — inception/mercury-2 — writes that goal up as an org TODO outline, where each leaf carries a :done-when: shell gate that must pass for the step to count as done. The run executes, and its results land on TASKS.org, the task ledger generated from the runtime and never hand-edited. The next call picks up from there.

The loop closes most sharply on the scariest CLAIMED in the whole audit — thousands of agents per server, insanely high concurrency, whose cited reality is brutal: the first live co-tenant run overloaded a four-agent box. On a real archived call the founder converted that worry into work by voice, in one sentence — file an issue titled load benchmark for concurrent agents, description: we claim thousands of agents per server with no proof, build a benchmark. The agent's reply: issue filed. One claim, one verdict still pending, one benchmark now on the backlog — the method, end to end, by talking.

a ledger you can GREP

Depth rung — skippable. Because every status is a headline keyword in a plain text file, the health of your pitch is one shell command away. You do not need a dashboard; you need grep:

$ grep -c '^\*\* UNKEPT'  audits/spoken-thesis.org   # 1  — said, not built
$ grep -c '^\*\* CLAIMED' audits/spoken-thesis.org   # 6  — still unchecked
$ grep '^\*\* ' audits/spoken-thesis.org | awk '{print $2}' | sort | uniq -c
   # KEPT 7  ·  PARTIAL 3  ·  CLAIMED 6  ·  UNKEPT 1  ·  FUTURE 1

Those are the real counts from the one live ledger. They tell you something the prose can't: this audit is ahead of its own data. The ladder declares RETRACTED and VALIDATING, yet neither has a single live instance — six claims still sit unchecked at CLAIMED, and the recommended retraction of the nine-9s line has not been stamped RETRACTED yet, only argued for. That is honest, and grep is what makes it visible. Pitch health becomes arithmetic: count the UNKEPTs, list every claim still missing evidence, diff this week's audit against last week's. The discipline is legible to a human, an agent, and a one-line script at the same time.

more than a CHECKLIST

Depth rung — skippable. A claims audit is not just a column of verdicts. The same pass over a single 91-minute ramble extracts a full working document. From the one real audit:

also extracted	what it captures
the argument skeleton	the pitch rebuilt as a numbered structure — the spine the claims hang on
the audiences	who each part of the talk is actually for
the contrast objects	named competitors and foils — E2B, Daytona, Firecracker, StackBlitz, Letta, Tauri
the open questions	the dangling threads the founder left unresolved out loud

One unscripted ramble, run through this method, yields a checked claims ledger, a positioning map, a competitor list, and a question backlog — a working document, not a checklist. The transcript was never the deliverable. The audit is.

where it BITES

Honesty section — and a claims lesson that hid its own UNKEPTs would be a bad joke. Four places this bites.

First, status is judgment, not proof. Someone — a human or an agent — reads the repo and decides KEPT or PARTIAL. The evidence rule constrains that judgment but does not automate it; a generous auditor can still stamp KEPT on a half-truth. The discipline raises the cost of self-deception; it does not abolish it.

Second, KEPT rots. Evidence ages. A claim that was true against last quarter's repo can quietly go stale, and a verdict carries no expiry date. A KEPT is a snapshot, not a guarantee — which is why re-auditing is part of the method, not a failure of it.

Third — and this one is precise — the runtime can't execute this exact ladder yet. The workflow engine reads a file's own #+TODO: line, but its parser does not strip the Emacs fast-selection keys: it reads KEPT(k) as one token, so a headline stamped ** KEPT does not match and parses as no state at all. Emacs org-mode itself strips the (k); our runtime parser doesn't. So today the claims ladder is fully readable by humans, agents, and grep — but it is not executable workflow state in the engine. That is a real gap, named here because the method demands it.

Fourth, n equals one. One 91-minute memo has been audited this way. And the drift validator the method implies — a general checker that reads an org claims file and flags a verdict without evidence — does not exist yet. wb content check validates a CMS content tree, and :done-when: gates are unit tests for org workflows, but a generic claims-drift linter is itself on the backlog this method generated. The method's sharpest promise is one of its own UNKEPT items. We would rather tell you that than imply otherwise.

questions people actually ASK

Isn't this just fact-checking the pitch?

No — fact-checking ends at a verdict column. This doesn't. VALIDATING and the verdicts feed a backlog: an UNKEPT is a ticket, not a scarlet letter, and the loop dispatches research that closes it. The output isn't a grade, it's work.

Who assigns the status?

Whoever can cite evidence — a human or an agent, it doesn't matter. The rule isn't about authority, it's about the citation: no terminal status crosses into the file without a file, an epic, a test, or a live run behind it.

Why org and not a spreadsheet?

Because the ladder lives in the same grammar as the workflows it spawns. A claim is an org TODO; the research that closes it is an org TODO; the task ledger is org. A spreadsheet would make the audit a fifth system of record — the exact fragmentation org is here to end.

Can I retract a KEPT?

Yes, and you should when evidence ages. A verdict is a snapshot of a moment, not a permanent ruling. Re-audit; if the repo moved out from under a KEPT, move the claim back to VALIDATING and check it again.

Do I need the runtime to do this?

No. The whole method runs on a transcript and a text file. The eight-word #+TODO: line, the quote-then-evidence anatomy, and the evidence-or-no-verdict rule are a discipline, not a product — you can audit your own last pitch tonight in any editor.

Where does an UNKEPT claim actually go?

Into the backlog seam the autopoet reads — the same file_issue lane a working agent uses to flag its own gaps. An UNKEPT spoken in a pitch and a bug filed by an agent mid-task land in the same place: work, in the declarative layer.

keep GOING

Claims are the sharpest argument the grammar makes — here is where it comes from and what it feeds.

Org, the grammarwhy a plan, or a pitch, can be checked at all

→

Workflowsthe org TODO outlines that close a claim into research

→

Agentsthe groundskeeper that captures claims mid-call

→

The Autopoetthe backlog seam an UNKEPT claim feeds

→