Search & Indexing
mdvs builds a search index by chunking your markdown content, embedding it with a local model, and storing chunks + vectors + frontmatter in a single LanceDB dataset. Queries are served by LanceDB natively — semantic (vector), full-text (BM25), or hybrid (both, reranked) — with optional SQL filtering on frontmatter fields.
Building the index
mdvs build (or mdvs init with auto-build) creates the search index in three steps: chunk, embed, store.
Chunking
Each file’s markdown body is split into semantic chunks — respecting headings, paragraphs, and code blocks rather than cutting at arbitrary character boundaries. The maximum chunk size is configurable (default 1024 characters) via the [chunking] section in mdvs.toml:
[chunking]
max_chunk_size = 1024
Each chunk tracks its start and end line numbers in the original file, so search results can point to the exact location.
Embedding
Chunks are embedded into dense vectors using a local Model2Vec model by Minish — static embeddings that run on CPU with no external services or GPU required. The model is downloaded from HuggingFace to the local cache on first use.
[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"
The default is potion-base-8M, a good balance of size and quality. The full POTION family:
| Model | Parameters | Notes |
|---|---|---|
minishlab/potion-base-2M | 2M | Smallest, fastest |
minishlab/potion-base-8M | 8M | Default — good balance |
minishlab/potion-base-32M | 32M | Higher quality, slower |
minishlab/potion-retrieval-32M | 32M | Optimized for retrieval tasks |
minishlab/potion-multilingual-128M | 128M | 101 languages |
Any Model2Vec-compatible model on HuggingFace works — set the name to its model ID. You can pin a specific revision for reproducibility.
Storage
A single Lance dataset is written to .mdvs/index.lance/ — one row per chunk, with everything you need on the same row:
| Column | Purpose |
|---|---|
chunk_id, file_id, chunk_index, start_line, end_line | Chunk identity and source location |
chunk_text | The plain-text chunk body — used by the full-text index and shown as the snippet in verbose results |
embedding | Dense vector for semantic search (FixedSizeList<Float32>) |
filepath, content_hash, built_at | Per-file metadata (duplicated on each of that file’s chunks) |
data | Frontmatter as an Arrow Struct (nested for dotted-name fields) — this is what --where filters query against |
Inside the dataset, two indexes are built at mdvs build time:
- A full-text BM25 index on
chunk_text, always built. - A cosine IVF-PQ vector index on
embedding, only built when the index has at least ~10,000 chunks. Smaller vaults use LanceDB’s exact flat scan, which is plenty fast at that scale.
Incremental builds
Build only re-embeds what changed. Each file’s markdown body (excluding frontmatter) is hashed, and the hash is compared against the existing index:
| Classification | Condition | Action |
|---|---|---|
| New | File not in index | Chunk, embed, add |
| Edited | Hash changed | Re-chunk, re-embed, replace chunks |
| Unchanged | Hash matches | Keep existing chunks |
| Removed | In index but not on disk | Drop file and its chunks |
Frontmatter-only changes (adding a tag, fixing a typo in author) rewrite the data column on every chunk row without re-embedding — the body hash hasn’t changed, so the vectors are still valid.
When nothing needs embedding, the model isn’t even loaded. A --force flag triggers a full rebuild regardless of hashes.
How search works
When you run mdvs search "query" example_kb, LanceDB does the heavy lifting. The shape of the work depends on --mode (default hybrid):
semantic— the query is embedded with the same model used during build, and chunks are ranked by cosine similarity againstembedding. Up to ~10,000 chunks, LanceDB does an exact flat scan; above that, the IVF-PQ vector index narrows the candidate set first.fulltext— the query is tokenized and scored against the BM25 full-text index onchunk_text. No model load needed.hybrid— both of the above run in parallel and their result lists are combined by LanceDB’s Reciprocal Rank Fusion reranker. Default mode because it tolerates queries that are either keyword-y or fuzzy.
For guidance on which mode to reach for, see Search Modes.
After LanceDB returns ranked chunk rows, mdvs deduplicates to the best chunk per file (a file with one highly relevant section ranks above a file with uniformly mediocre content) and then trims to --limit (default 10). LanceDB is asked for limit × 3 candidates to make sure dedupe has enough material to work with.
Scores
The score column in search output depends on the mode:
- Semantic — cosine similarity, a value in roughly
[0, 1](higher = more similar). - Fulltext — BM25 relevance score, unbounded above (higher = better match).
- Hybrid — RRF score, also unbounded above.
Scores depend on the mode, the model, and the content, so there’s no universal threshold for “relevant.” Compare scores relative to each other within a single query.
Filtering with --where
Add a SQL filter to narrow results by frontmatter fields:
mdvs search "calibration" example_kb --where "status = 'active'"
The --where clause filters on frontmatter fields — only chunks whose file matches the filter are included in the results. The filter and similarity ranking are combined in a single LanceDB query, so non-matching rows are excluded efficiently.
You can use any SQL expression that LanceDB’s filter supports:
--where "draft = false"
--where "status = 'active' AND author = 'Giulia Ferretti'"
--where "sample_count > 10"
Array fields, nested objects, and field names with special characters require specific syntax — see the Search Guide for the full reference.
Model identity
Search refuses to run if the model configured in mdvs.toml doesn’t match the model that was used to build the index. This is a hard error, not a warning.
Embeddings from different models are incompatible — cosine similarity between vectors from different models produces meaningless scores. If you change the model, rebuild the index with mdvs build --force.