search finds words, not MEANING
The disk lesson ended on a magic trick: your file system is a
database, so a listing is a query and "every file the agent touched this week" is one
SELECT. True — and not enough. SQL, and its rougher cousin grep, match
characters. Ask the disk for the bit about retry backoff and if the note
actually says wait longer between attempts, you get zero rows. The words
didn't match. The meaning was right there.
The industry's standard fix is heavy. Stand up a vector database — a whole new service. Call an embedding API for every query — a per-call cost, your text leaving the box, a key to manage. Glue the two together and keep them in sync. For a disk that rides inside a single file, that's absurd overhead. And the obvious cheap alternative — just grep harder — is the exact thing that already failed.
So the missing piece isn't a bigger database or a smarter API key. It's a second way to ask the disk a question — by meaning — that's light enough to ride in the same file, and free enough to leave on. That's what this lesson is about. It's the semantic twin of the literal-query lesson: same disk, same store, a different way to reach the bytes.
the DEFINITION
1. a tenant-scoped semantic index riding the same database as the rest of the workbook — text and files turned into vectors at store time, so a query can find by meaning, not just by word. One query modality of a queryable workbook — Query, not Memory.
The framing matters more than it looks. Vector search here is one
modality — the literal SQL of the queries lesson is
done, semantic is this page, a graph modality is future work. An agent's "memory" is
one consumer of it, not its definition. So the construct is called
Query, not Memory — and embeddings are pluggable and not
required. Literal queries need no model at all; turn the embedder off and the disk is
still fully queryable, just with one fewer way to ask.
four rungs of EMBEDDER
An embedder is the thing that turns text into a vector — a list of numbers
where closeness means similar meaning. One environment variable, WB_EMBED,
picks which one runs, and the rungs climb from zero-dependency to anything-you-want.
The default is deliberately the cheapest one that needs no download:
| rung | WB_EMBED | dim | quality | cost | where it runs |
|---|---|---|---|---|---|
| hash | hash (default) | 256 | lexical only — not semantic | zero — no model | pure Elixir, in-process |
| local | local | 256 | real semantics — dog↔puppy 0.70 | free after one ~30MB download | pure Elixir, in-process |
| openrouter | openrouter | 1536 | transformer quality | per-call API — text leaves the box | OpenRouter, over the network |
| http / clip | http:<url> | any | anything, any modality | your endpoint, your bill | Modal / Replicate / your sidecar / opt-in CLIP build |
The honest read of that table: hash is the zero-config baseline — it hashes
character three-grams and word tokens into signed buckets, so it's a smarter grep, but
it is lexically driven, not semantic. The moment you want real meaning, you flip
WB_EMBED=local and get Model2Vec — genuine paraphrase matching, no
GPU, no API, no recurring cost. Above that, openrouter reuses your
OPENROUTER_API_KEY for transformer-grade quality at a per-call price; and
the top rung, http, posts to any endpoint you name — your own GPU sidecar, a
hosted model, or the opt-in CLIP build for images. Every rung answers the identical
search call; nothing downstream knows which one ran. The next two sections
open the two most interesting rungs.
a model that is just a MATRIX
depth rung · skippable — how the free semantic rung works
Here's the aha that makes free semantic search possible: an embedding model doesn't have to be a model runtime. The local rung is Model2Vec, and a Model2Vec "model" is two files — a vocabulary and a precomputed matrix of vectors, roughly 8 to 30 MB of 32-bit floats. There is no transformer forward pass, no GPU, no native code. "Inference" is four steps of arithmetic: tokenize the text, look up each token's row in the matrix, average them, and normalize the result to length one. The whole thing is about two hundred lines of pure Elixir.
flowchart LR
q["retry backoff"]
subgraph tok["pure-Elixir WordPiece"]
direction TB
t1["retry"]
t2["backoff"]
end
subgraph ids["token ids — line number in vocab.txt"]
direction TB
i1["id 8431"]
i2["id 12907"]
end
subgraph mat["the matrix — one refc binary, ~30MB"]
direction TB
o1["offset 8431·256·4 → 256 floats"]
o2["offset 12907·256·4 → 256 floats"]
end
avg["average the rows
÷ L2 norm"]
vec["the query vector
256 numbers"]
q --> tok --> ids --> mat --> avg --> vec
style q fill:#ffffff,stroke:#121316
style tok fill:#fbfaf6,stroke:#121316
style ids fill:#fbfaf6,stroke:#121316
style mat fill:#fbfaf6,stroke:#121316
style avg fill:#aee5c2,stroke:#121316
style vec fill:#a8d4f0,stroke:#121316,stroke-width:2.5px
Walk that graph as the words retry backoff: the pure-Elixir WordPiece
tokenizer downcases and splits into retry and backoff by
longest-prefix matching against the vocabulary — an unmatchable word becomes
[UNK]. Each token's id is simply its line number in vocab.txt.
That id is the only address you need: the token's vector is a byte slice of the matrix,
offset = id · 256 · 4 bytes in, decoded as 256 little-endian floats.
Average the two rows, divide by the L2 norm, and that's the entire inference.
Two design choices keep it cheap. The matrix loads once into
:persistent_term and one resident copy serves every tenant —
shared compute, never shared data, the same isolation rule the disk follows. And it
stays a single reference-counted binary with by-offset lookup rather than a map of
float-lists, which would balloon inside the BEAM. The model file itself is a
safetensors blob whose data section is the matrix — the parser is six lines:
read the header length, read the JSON header, and the rest of the file is the numbers.
Does it actually work? The receipt: dog and puppy land at cosine 0.70 — synonym and
paraphrase matching the lexical hash baseline simply cannot do. Configuration is one
variable each: WB_EMBED=local turns it on; the default model is
minishlab/potion-base-8M, fetched once from HuggingFace to
<WB_DATA>/_models/ (or pre-baked into the image at
/opt/models for zero download); WB_EMBED_MODEL swaps the
potion variant.
embed on WRITE, not on read
Here's the load-bearing decision that makes any embedder — local or paid — affordable: you embed at store time, not query time. Writes are rare; reads are hot. So the embedder is hit when a file is saved, once, and never again on the way out. A query becomes a pure lookup against vectors that already exist. Even a network embedder you pay per call only costs you on writes — and a query against a thousand stored vectors costs nothing.
sequenceDiagram participant W as a write — store/3 participant I as index — chunk + embed participant E as the embedder participant V as the vector store participant Q as a query — search rect rgb(251,250,246) Note over W,V: the heavy path — runs on WRITES (rare) W->>I: Vector.forget · then re-index I->>I: chunk — org by headline, code by 20-line windows I->>E: batch embed every chunk E-->>I: vectors I->>V: upsert each chunk + its source metadata end rect rgb(232,243,250) Note over Q,V: the skinny path — runs on READS (hot) Q->>E: embed the query string once Q->>V: nearest-neighbour lookup — no model touched per row V-->>Q: ranked hits, highest score first end
Read the diagram as two paths of very different weight. The top half is the write
path inside Library.store: it forgets any old index for the member so a
re-index is always clean, chunks the content, batch-embeds every chunk, and upserts
each one — carrying its source workbook, path, and headline so a hit can point back to
where it came from. An embed failure returns a clean error and never crashes the store.
The bottom half is the read path: embed the query string once, then a nearest-neighbour
lookup that touches no model per row. Heavy work, done rarely; light work, done
constantly.
Chunking falls straight out of the grammar. An .org
file splits into one chunk per heading section, headline extracted; everything else —
code, markdown, plain text across a generous allowlist — splits into 20-line windows
labelled lines 1–20 and so on. And a member referenced across workbooks by
its identity (DID) is resolved and indexed too, so a query can reach into a federated
document, not just the local tree.
one interface, two ENGINES
The cleverest part of the store is the part you never see. There is one
Vector.search(tenant, query_vec, k: 5) call, and it runs against one of
two completely different engines depending on a single fact about your environment —
and no caller anywhere branches on which.
| SQLite — the default | pgvector — set WB_DATABASE_URL | |
|---|---|---|
| storage | vectors as JSON text in a column | a real vector(dim) column |
| ranking | brute-force cosine, in Elixir | ANN in the DB via the <=> operator |
| complexity | O(n) per query — every row scanned | sub-linear, indexable |
| scope filter | filtered in Elixir, after the scan | workbook = ANY(...) pushed into SQL |
| when | live-tested default; fine to tens of thousands | built and shape-tested; the answer at scale |
The verdict of that table: you chose your vector database when you chose your
database. With no WB_DATABASE_URL, you're on SQLite — vectors stored as
JSON, cosine computed in Elixir over every row, sorted, top-k taken. It's O(n), and it
is honest about being O(n). Set WB_DATABASE_URL and the same calls become
real pgvector: the store auto-runs CREATE EXTENSION IF NOT EXISTS vector,
sizes the column from the active embedder's dimension, and ranks in-database with
ORDER BY vec <=> q LIMIT k — indexable and sub-linear. Flipping that
one variable is the entire migration.
-- SQLite: brute-force, honest about the cost
SELECT … FROM vectors WHERE tenant = ?1
→ cosine in Elixir over every row → sort → take k. O(n).
-- Postgres: the database does the ranking
SELECT id, workbook, path, headline,
1 - (vec <=> $q::vector) AS score, text
FROM vectors WHERE tenant = $1
ORDER BY vec <=> $q::vector LIMIT 5;
→ sub-linear, indexable. same call, no caller changed.
And the O(n) cost on SQLite is made visible rather than hidden. Past 25,000 vectors, the store logs exactly once: brute-force search over N vectors on SQLite (O(n) per query) — for sub-linear ANN at scale, set WB_DATABASE_URL → pgvector. The design reasoning is deliberate: a worker pool would only buy a constant-factor speedup (your core count), while the real fix at scale is sub-linear ANN. So instead of hiding the linear cost behind threads, the store tells you when you've outgrown the cheap default and points at the door.
the machine PROBE
How does the system know which rung your hardware can actually run? It probes. At
boot, a pure-detection capability check reads the OS, core count, RAM, and accelerator —
CUDA if nvidia-smi is on the PATH, Metal on Apple silicon, otherwise none —
and runs a cond ladder to a recommended tier. No inference happens; it's
detection only.
flowchart TD
start["probe: os · cores · ram · accelerator"]
q1{"accelerator ≠ none
AND ram ≥ 16 GB?"}
q2{"cores ≥ 4
AND ram ≥ 4 GB?"}
big["big_multimodal
a 7B VLM, run locally"]
clip["clip
image + text, joint space"]
m2v["model2vec
text only — always works"]
start --> q1
q1 -- yes --> big
q1 -- no --> q2
q2 -- yes --> clip
q2 -- no --> m2v
style start fill:#ffffff,stroke:#121316
style q1 fill:#fbfaf6,stroke:#121316
style q2 fill:#fbfaf6,stroke:#121316
style big fill:#f3c5a3,stroke:#121316
style clip fill:#a8d4f0,stroke:#121316
style m2v fill:#aee5c2,stroke:#121316,stroke-width:2.5px
Walk the ladder top to bottom. If there's a real accelerator and at least 16 GB of
RAM, it recommends big_multimodal — the class that runs a 7B vision-language
model locally. Failing that, if you have at least four cores and 4 GB, it recommends
clip — image and text in one joint space. And if neither holds, it lands on
model2vec: text only, and the one that always works on any machine. One honest
caveat: this recommendation is advisory. It's printed in the boot line —
search config — embed: <adapter>, vectors: <backend>, machine recommends:
<tier> — but it does not auto-switch your embedder. WB_EMBED stays
the operator's call; the probe only tells you what your box could handle.
semantic, AND literal
depth rung · skippable — how the default search fuses two rankings
Pure semantic search has a blind spot: ask for an exact error code or a function name and meaning-similarity can drift right past the literal string you typed. Pure literal search has the opposite blind spot — it's the grep we started by rejecting. So the default search mode is hybrid: it runs both and fuses them.
flowchart TD query["the query"] sem["semantic pool
vector search, top 50"] lit["literal pass
count query terms in each chunk"] rrf["reciprocal-rank fusion
score = Σ 1 / (60 + rank)
absent ranking → rank 1000"] out["one ranked list
highest score first"] query --> sem --> rrf query --> lit --> rrf --> out style query fill:#ffffff,stroke:#121316 style sem fill:#a8d4f0,stroke:#121316 style lit fill:#f2ddb0,stroke:#121316 style rrf fill:#aee5c2,stroke:#121316 style out fill:#13d943,stroke:#121316,stroke-width:2.5px
Read the graph as two scouts reporting to one judge. The first scout, semantic,
pulls a pool of fifty meaning-matches from the vector store. The second, literal, scores
each chunk by how many of your query terms it literally contains. The judge combines
them with reciprocal-rank fusion — each chunk's score is the sum of one over sixty plus
its rank in each list, and a chunk absent from a ranking is penalised as if it placed
one-thousandth. The result is one list where a chunk has to do well in at least one
scout's eyes to surface, and a chunk strong in both rockets to the top. When you
want only one scout, --semantic and --literal are the escape
hatches — but fusion is the right default, because semantic finds the meaning and literal
pins the exact terms, and most real questions need both.
the files ARE the memory
This is where the "Query, not Memory" framing pays off concretely. An
agent here has a search tool described as recall by
meaning — and crucially, there is no separate memory store behind it. The tool calls
a stateless recall over the agent's working directory: it chunks, embeds, and
ranks the actual files on the fly, with no stored index to drift out of sync. To
"remember" something, the agent writes an org file with vfs_write; it
becomes searchable automatically, because searching reads the real
org and code context, not a summary of it. This is the retrieval half
of the memory-is-a-workbook-of-plain-files claim.
And the same recall is available without an agent at all. A workflow
task can carry a retrieve block that runs as a non-LLM step — semantic recall, zero model
reasoning, dropping ranked chunks into scratch/ for the next task to read:
#+begin_src retrieve :k 5
how do we handle auth headers
#+end_src
→ runs Library.search_dir, ranks the real org/code,
writes the top 5 chunks to scratch/ — no LLM call.
That's the deeper identity of this whole page. Vector search isn't an AI feature bolted onto a disk; it's a query modality, and an agent is just one of its consumers. The same surface answers a shell command, a cron job, and an agent alike — none of them assumes a model is in the loop.
where the index ENDS
Honesty section. This index has real ceilings, and naming them is the point.
- The SQLite path is O(n). It scans every vector per query. It stays pleasant into the tens of thousands and then tells you, once, to move to pgvector. That's a designed ceiling with a clear door, not a silent cliff.
- Static embeddings trade quality for size. Model2Vec is genuinely semantic,
but a real transformer embedder beats it on hard paraphrase. The rungs above —
OpenRouter, your own
http:endpoint, CLIP — exist for exactly when you've outgrown it. - The default isn't semantic.
hashships on so the construct works with zero download — but it's lexical. Real meaning starts atWB_EMBED=local, and that line is the single most important flip on this page. - Separate models mean separate spaces. A text embedder and an image embedder
that aren't the same model can't find each other — text-finds-text, image-finds-image.
Cross-modal search (a text query finding an image) needs one shared space: the
CLIP build or a single multimodal endpoint via
WB_EMBED_MULTIMODAL=http:<url>. - Images are opt-in. In-BEAM CLIP lives behind a
WB_CLIP=1build (~170 MB of models and runtime); the lean default carries zero native ML and gives an honest error if you ask for CLIP without building it. Audio isn't covered at all — video is handled as keyframes embedded like images, but there's no audio model. - The faster SQLite ANN isn't built.
sqlite-vecis named as the future upgrade for when the extension is loadable — it isn't wired today. And the ~30 MB local model downloads once from HuggingFace unless you bake it into the image.
questions people actually ASK
Does my text leave the machine?
Only if you choose a rung that sends it. On hash and
WB_EMBED=local nothing leaves — the embedder runs in-process, pure Elixir.
The openrouter and http: rungs do call out by design, because
that's the trade you opted into for higher quality. The default keeps your text in the
box.
Do tenants share the model?
Shared compute, never shared data. The local matrix loads once into
:persistent_term and one resident copy serves every tenant — but a tenant
is the first argument of every store and search call, and one tenant's vectors are never
visible to another. The same scoping rule as the disk itself.
What happens if the embedder is down?
Nothing crashes. An embed failure on write returns a clean
{:error, "embed failed: …"}; an unavailable embedder on read returns an
empty result, not an exception. Indexing is best-effort by design — a flaky embedder
degrades search, it doesn't break the workbook.
Can a text query find images?
Only when text and images live in one shared space. That means the CLIP build or a single multimodal endpoint, where a text query is embedded in the image model's space — for CLIP, that's the CLIP-text tower. Wire up a separate image model and a separate text model and they each only find their own kind; cross-modal needs the one joint space.
Why not just use a real vector database?
You can — that's what WB_DATABASE_URL does, turning the store into real
pgvector with ANN. The point is you don't have to stand one up to get useful
semantic search on a project-sized disk. The default rides in the same database as
everything else, and you graduate to a dedicated engine the day your scale asks for it,
by changing one variable.
Why is the default hash and not local?
So the construct works the instant you turn it on, with no download and no model
file — a zero-dependency baseline that's a smarter grep. It is not semantic, and the
page is loud about that. The first thing most people should do is set
WB_EMBED=local for real meaning at no recurring cost.
keep GOING
This index makes the disk findable-by-meaning — but it sits among neighbours, and it reads best once you've met them.