vectors — search the workbook by meaning

search finds words, not MEANING

The disk lesson ended on a magic trick: your file system is a database, so a listing is a query and "every file the agent touched this week" is one SELECT. True — and not enough. SQL, and its rougher cousin grep, match characters. Ask the disk for the bit about retry backoff and if the note actually says wait longer between attempts, you get zero rows. The words didn't match. The meaning was right there.

The industry's standard fix is heavy. Stand up a vector database — a whole new service. Call an embedding API for every query — a per-call cost, your text leaving the box, a key to manage. Glue the two together and keep them in sync. For a disk that rides inside a single file, that's absurd overhead. And the obvious cheap alternative — just grep harder — is the exact thing that already failed.

So the missing piece isn't a bigger database or a smarter API key. It's a second way to ask the disk a question — by meaning — that's light enough to ride in the same file, and free enough to leave on. That's what this lesson is about. It's the semantic twin of the literal-query lesson: same disk, same store, a different way to reach the bytes.

the DEFINITION

vec·tors /ˈvek·tərz/ noun

1. a tenant-scoped semantic index riding the same database as the rest of the workbook — text and files turned into vectors at store time, so a query can find by meaning, not just by word. One query modality of a queryable workbook — Query, not Memory.

The framing matters more than it looks. Vector search here is one modality — the literal SQL of the queries lesson is done, semantic is this page, a graph modality is future work. An agent's "memory" is one consumer of it, not its definition. So the construct is called Query, not Memory — and embeddings are pluggable and not required. Literal queries need no model at all; turn the embedder off and the disk is still fully queryable, just with one fewer way to ask.

four rungs of EMBEDDER

An embedder is the thing that turns text into a vector — a list of numbers where closeness means similar meaning. One environment variable, WB_EMBED, picks which one runs, and the rungs climb from zero-dependency to anything-you-want. The default is deliberately the cheapest one that needs no download:

rung	WB_EMBED	dim	quality	cost	where it runs
hash	`hash` (default)	256	lexical only — not semantic	zero — no model	pure Elixir, in-process
local	`local`	256	real semantics — dog↔puppy 0.70	free after one ~30MB download	pure Elixir, in-process
openrouter	`openrouter`	1536	transformer quality	per-call API — text leaves the box	OpenRouter, over the network
http / clip	`http:<url>`	any	anything, any modality	your endpoint, your bill	Modal / Replicate / your sidecar / opt-in CLIP build

The honest read of that table: hash is the zero-config baseline — it hashes character three-grams and word tokens into signed buckets, so it's a smarter grep, but it is lexically driven, not semantic. The moment you want real meaning, you flip WB_EMBED=local and get Model2Vec — genuine paraphrase matching, no GPU, no API, no recurring cost. Above that, openrouter reuses your OPENROUTER_API_KEY for transformer-grade quality at a per-call price; and the top rung, http, posts to any endpoint you name — your own GPU sidecar, a hosted model, or the opt-in CLIP build for images. Every rung answers the identical search call; nothing downstream knows which one ran. The next two sections open the two most interesting rungs.

a model that is just a MATRIX

depth rung · skippable — how the free semantic rung works

Here's the aha that makes free semantic search possible: an embedding model doesn't have to be a model runtime. The local rung is Model2Vec, and a Model2Vec "model" is two files — a vocabulary and a precomputed matrix of vectors, roughly 8 to 30 MB of 32-bit floats. There is no transformer forward pass, no GPU, no native code. "Inference" is four steps of arithmetic: tokenize the text, look up each token's row in the matrix, average them, and normalize the result to length one. The whole thing is about two hundred lines of pure Elixir.

flowchart LR
  q["retry backoff"]
  subgraph tok["pure-Elixir WordPiece"]
    direction TB
    t1["retry"]
    t2["backoff"]
  end
  subgraph ids["token ids — line number in vocab.txt"]
    direction TB
    i1["id 8431"]
    i2["id 12907"]
  end
  subgraph mat["the matrix — one refc binary, ~30MB"]
    direction TB
    o1["offset 8431·256·4 → 256 floats"]
    o2["offset 12907·256·4 → 256 floats"]
  end
  avg["average the rows
÷ L2 norm"]
  vec["the query vector
256 numbers"]
  q --> tok --> ids --> mat --> avg --> vec
  style q fill:#ffffff,stroke:#121316
  style tok fill:#fbfaf6,stroke:#121316
  style ids fill:#fbfaf6,stroke:#121316
  style mat fill:#fbfaf6,stroke:#121316
  style avg fill:#aee5c2,stroke:#121316
  style vec fill:#a8d4f0,stroke:#121316,stroke-width:2.5px

Walk that graph as the words retry backoff: the pure-Elixir WordPiece tokenizer downcases and splits into retry and backoff by longest-prefix matching against the vocabulary — an unmatchable word becomes [UNK]. Each token's id is simply its line number in vocab.txt. That id is the only address you need: the token's vector is a byte slice of the matrix, offset = id · 256 · 4 bytes in, decoded as 256 little-endian floats. Average the two rows, divide by the L2 norm, and that's the entire inference.

Two design choices keep it cheap. The matrix loads once into :persistent_term and one resident copy serves every tenant — shared compute, never shared data, the same isolation rule the disk follows. And it stays a single reference-counted binary with by-offset lookup rather than a map of float-lists, which would balloon inside the BEAM. The model file itself is a safetensors blob whose data section is the matrix — the parser is six lines: read the header length, read the JSON header, and the rest of the file is the numbers.

Does it actually work? The receipt: dog and puppy land at cosine 0.70 — synonym and paraphrase matching the lexical hash baseline simply cannot do. Configuration is one variable each: WB_EMBED=local turns it on; the default model is minishlab/potion-base-8M, fetched once from HuggingFace to <WB_DATA>/_models/ (or pre-baked into the image at /opt/models for zero download); WB_EMBED_MODEL swaps the potion variant.

embed on WRITE, not on read

Here's the load-bearing decision that makes any embedder — local or paid — affordable: you embed at store time, not query time. Writes are rare; reads are hot. So the embedder is hit when a file is saved, once, and never again on the way out. A query becomes a pure lookup against vectors that already exist. Even a network embedder you pay per call only costs you on writes — and a query against a thousand stored vectors costs nothing.

sequenceDiagram
  participant W as a write — store/3
  participant I as index — chunk + embed
  participant E as the embedder
  participant V as the vector store
  participant Q as a query — search
  rect rgb(251,250,246)
  Note over W,V: the heavy path — runs on WRITES (rare)
  W->>I: Vector.forget · then re-index
  I->>I: chunk — org by headline, code by 20-line windows
  I->>E: batch embed every chunk
  E-->>I: vectors
  I->>V: upsert each chunk + its source metadata
  end
  rect rgb(232,243,250)
  Note over Q,V: the skinny path — runs on READS (hot)
  Q->>E: embed the query string once
  Q->>V: nearest-neighbour lookup — no model touched per row
  V-->>Q: ranked hits, highest score first
  end

Read the diagram as two paths of very different weight. The top half is the write path inside Library.store: it forgets any old index for the member so a re-index is always clean, chunks the content, batch-embeds every chunk, and upserts each one — carrying its source workbook, path, and headline so a hit can point back to where it came from. An embed failure returns a clean error and never crashes the store. The bottom half is the read path: embed the query string once, then a nearest-neighbour lookup that touches no model per row. Heavy work, done rarely; light work, done constantly.

Chunking falls straight out of the grammar. An .org file splits into one chunk per heading section, headline extracted; everything else — code, markdown, plain text across a generous allowlist — splits into 20-line windows labelled lines 1–20 and so on. And a member referenced across workbooks by its identity (DID) is resolved and indexed too, so a query can reach into a federated document, not just the local tree.

one interface, two ENGINES

The cleverest part of the store is the part you never see. There is one Vector.search(tenant, query_vec, k: 5) call, and it runs against one of two completely different engines depending on a single fact about your environment — and no caller anywhere branches on which.

	SQLite — the default	pgvector — set WB_DATABASE_URL
storage	vectors as JSON text in a column	a real `vector(dim)` column
ranking	brute-force cosine, in Elixir	ANN in the DB via the `<=>` operator
complexity	O(n) per query — every row scanned	sub-linear, indexable
scope filter	filtered in Elixir, after the scan	`workbook = ANY(...)` pushed into SQL
when	live-tested default; fine to tens of thousands	built and shape-tested; the answer at scale

The verdict of that table: you chose your vector database when you chose your database. With no WB_DATABASE_URL, you're on SQLite — vectors stored as JSON, cosine computed in Elixir over every row, sorted, top-k taken. It's O(n), and it is honest about being O(n). Set WB_DATABASE_URL and the same calls become real pgvector: the store auto-runs CREATE EXTENSION IF NOT EXISTS vector, sizes the column from the active embedder's dimension, and ranks in-database with ORDER BY vec <=> q LIMIT k — indexable and sub-linear. Flipping that one variable is the entire migration.

-- SQLite: brute-force, honest about the cost
SELECT … FROM vectors WHERE tenant = ?1
   → cosine in Elixir over every row → sort → take k.  O(n).

-- Postgres: the database does the ranking
SELECT id, workbook, path, headline,
       1 - (vec <=> $q::vector) AS score, text
  FROM vectors WHERE tenant = $1
 ORDER BY vec <=> $q::vector LIMIT 5;
   → sub-linear, indexable.  same call, no caller changed.

And the O(n) cost on SQLite is made visible rather than hidden. Past 25,000 vectors, the store logs exactly once: brute-force search over N vectors on SQLite (O(n) per query) — for sub-linear ANN at scale, set WB_DATABASE_URL → pgvector. The design reasoning is deliberate: a worker pool would only buy a constant-factor speedup (your core count), while the real fix at scale is sub-linear ANN. So instead of hiding the linear cost behind threads, the store tells you when you've outgrown the cheap default and points at the door.

the machine PROBE

How does the system know which rung your hardware can actually run? It probes. At boot, a pure-detection capability check reads the OS, core count, RAM, and accelerator — CUDA if nvidia-smi is on the PATH, Metal on Apple silicon, otherwise none — and runs a cond ladder to a recommended tier. No inference happens; it's detection only.

flowchart TD
  start["probe: os · cores · ram · accelerator"]
  q1{"accelerator ≠ none
AND ram ≥ 16 GB?"}
  q2{"cores ≥ 4
AND ram ≥ 4 GB?"}
  big["big_multimodal
a 7B VLM, run locally"]
  clip["clip
image + text, joint space"]
  m2v["model2vec
text only — always works"]
  start --> q1
  q1 -- yes --> big
  q1 -- no --> q2
  q2 -- yes --> clip
  q2 -- no --> m2v
  style start fill:#ffffff,stroke:#121316
  style q1 fill:#fbfaf6,stroke:#121316
  style q2 fill:#fbfaf6,stroke:#121316
  style big fill:#f3c5a3,stroke:#121316
  style clip fill:#a8d4f0,stroke:#121316
  style m2v fill:#aee5c2,stroke:#121316,stroke-width:2.5px

Walk the ladder top to bottom. If there's a real accelerator and at least 16 GB of RAM, it recommends big_multimodal — the class that runs a 7B vision-language model locally. Failing that, if you have at least four cores and 4 GB, it recommends clip — image and text in one joint space. And if neither holds, it lands on model2vec: text only, and the one that always works on any machine. One honest caveat: this recommendation is advisory. It's printed in the boot line — search config — embed: <adapter>, vectors: <backend>, machine recommends: <tier> — but it does not auto-switch your embedder. WB_EMBED stays the operator's call; the probe only tells you what your box could handle.

semantic, AND literal

depth rung · skippable — how the default search fuses two rankings

Pure semantic search has a blind spot: ask for an exact error code or a function name and meaning-similarity can drift right past the literal string you typed. Pure literal search has the opposite blind spot — it's the grep we started by rejecting. So the default search mode is hybrid: it runs both and fuses them.

flowchart TD
  query["the query"]
  sem["semantic pool
vector search, top 50"]
  lit["literal pass
count query terms in each chunk"]
  rrf["reciprocal-rank fusion
score = Σ 1 / (60 + rank)
absent ranking → rank 1000"]
  out["one ranked list
highest score first"]
  query --> sem --> rrf
  query --> lit --> rrf --> out
  style query fill:#ffffff,stroke:#121316
  style sem fill:#a8d4f0,stroke:#121316
  style lit fill:#f2ddb0,stroke:#121316
  style rrf fill:#aee5c2,stroke:#121316
  style out fill:#13d943,stroke:#121316,stroke-width:2.5px

Read the graph as two scouts reporting to one judge. The first scout, semantic, pulls a pool of fifty meaning-matches from the vector store. The second, literal, scores each chunk by how many of your query terms it literally contains. The judge combines them with reciprocal-rank fusion — each chunk's score is the sum of one over sixty plus its rank in each list, and a chunk absent from a ranking is penalised as if it placed one-thousandth. The result is one list where a chunk has to do well in at least one scout's eyes to surface, and a chunk strong in both rockets to the top. When you want only one scout, --semantic and --literal are the escape hatches — but fusion is the right default, because semantic finds the meaning and literal pins the exact terms, and most real questions need both.

the files ARE the memory

This is where the "Query, not Memory" framing pays off concretely. An agent here has a search tool described as recall by meaning — and crucially, there is no separate memory store behind it. The tool calls a stateless recall over the agent's working directory: it chunks, embeds, and ranks the actual files on the fly, with no stored index to drift out of sync. To "remember" something, the agent writes an org file with vfs_write; it becomes searchable automatically, because searching reads the real org and code context, not a summary of it. This is the retrieval half of the memory-is-a-workbook-of-plain-files claim.

And the same recall is available without an agent at all. A workflow task can carry a retrieve block that runs as a non-LLM step — semantic recall, zero model reasoning, dropping ranked chunks into scratch/ for the next task to read:

#+begin_src retrieve :k 5
how do we handle auth headers
#+end_src
   → runs Library.search_dir, ranks the real org/code,
     writes the top 5 chunks to scratch/ — no LLM call.

That's the deeper identity of this whole page. Vector search isn't an AI feature bolted onto a disk; it's a query modality, and an agent is just one of its consumers. The same surface answers a shell command, a cron job, and an agent alike — none of them assumes a model is in the loop.

where the index ENDS

Honesty section. This index has real ceilings, and naming them is the point.

The SQLite path is O(n). It scans every vector per query. It stays pleasant into the tens of thousands and then tells you, once, to move to pgvector. That's a designed ceiling with a clear door, not a silent cliff.
Static embeddings trade quality for size. Model2Vec is genuinely semantic, but a real transformer embedder beats it on hard paraphrase. The rungs above — OpenRouter, your own http: endpoint, CLIP — exist for exactly when you've outgrown it.
The default isn't semantic. hash ships on so the construct works with zero download — but it's lexical. Real meaning starts at WB_EMBED=local, and that line is the single most important flip on this page.
Separate models mean separate spaces. A text embedder and an image embedder that aren't the same model can't find each other — text-finds-text, image-finds-image. Cross-modal search (a text query finding an image) needs one shared space: the CLIP build or a single multimodal endpoint via WB_EMBED_MULTIMODAL=http:<url>.
Images are opt-in. In-BEAM CLIP lives behind a WB_CLIP=1 build (~170 MB of models and runtime); the lean default carries zero native ML and gives an honest error if you ask for CLIP without building it. Audio isn't covered at all — video is handled as keyframes embedded like images, but there's no audio model.
The faster SQLite ANN isn't built. sqlite-vec is named as the future upgrade for when the extension is loadable — it isn't wired today. And the ~30 MB local model downloads once from HuggingFace unless you bake it into the image.

questions people actually ASK

Does my text leave the machine?

Only if you choose a rung that sends it. On hash and WB_EMBED=local nothing leaves — the embedder runs in-process, pure Elixir. The openrouter and http: rungs do call out by design, because that's the trade you opted into for higher quality. The default keeps your text in the box.

Do tenants share the model?

Shared compute, never shared data. The local matrix loads once into :persistent_term and one resident copy serves every tenant — but a tenant is the first argument of every store and search call, and one tenant's vectors are never visible to another. The same scoping rule as the disk itself.

What happens if the embedder is down?

Nothing crashes. An embed failure on write returns a clean {:error, "embed failed: …"}; an unavailable embedder on read returns an empty result, not an exception. Indexing is best-effort by design — a flaky embedder degrades search, it doesn't break the workbook.

Can a text query find images?

Only when text and images live in one shared space. That means the CLIP build or a single multimodal endpoint, where a text query is embedded in the image model's space — for CLIP, that's the CLIP-text tower. Wire up a separate image model and a separate text model and they each only find their own kind; cross-modal needs the one joint space.

Why not just use a real vector database?

You can — that's what WB_DATABASE_URL does, turning the store into real pgvector with ANN. The point is you don't have to stand one up to get useful semantic search on a project-sized disk. The default rides in the same database as everything else, and you graduate to a dedicated engine the day your scale asks for it, by changing one variable.

Why is the default hash and not local?

So the construct works the instant you turn it on, with no download and no model file — a zero-dependency baseline that's a smarter grep. It is not semantic, and the page is loud about that. The first thing most people should do is set WB_EMBED=local for real meaning at no recurring cost.

keep GOING

This index makes the disk findable-by-meaning — but it sits among neighbours, and it reads best once you've met them.

The VFSthe disk this index makes findable by meaning

→

Queriesthe literal SQL twin of this page

→

Agentsthe consumer with the recall tool

→

Orgwhere chunking comes from — one chunk per headline

→