Introduction
mdvs treats your markdown directory like a database. It scans your files, infers a typed schema from frontmatter, validates it, and builds a local search index — all in a single binary with no external services.
Not a document database. A database for documents.
The problem
Markdown directories grow organically. You start with a few notes, add frontmatter when it’s useful, and eventually have hundreds of files with inconsistent metadata. Tags are misspelled. Required fields are missing. You can’t find anything without grep.
mdvs gives you structure without forcing you to change how you write.
Frontmatter
Frontmatter is the YAML block between --- fences at the top of a markdown file. It stores structured metadata alongside your content:
---
title: "Experiment A-017: SPR-A1 baseline calibration" # String
status: completed # String
author: Giulia Ferretti # String
draft: false # Boolean
priority: 2 # Integer
drift_rate: 0.023 # Float
tags: # Array(String)
- calibration
- SPR-A1
- baseline
---
# Your markdown content starts here...
mdvs recognizes these types automatically. When it scans your files, it infers the type of each field from the values it finds — no configuration needed.
TOML (
+++) and JSON ({...}) frontmatter are also supported, auto-detected per file. This guide uses YAML throughout; see[scan].frontmatter_formatfor the format knob and the Hugo recipe for mixed-format vaults.
Directory-aware schema
mdvs infers a three-dimensional schema from your files:
- Types — boolean, integer, float, string, arrays, nested objects. Inferred automatically, with widening when files disagree.
- Paths — which fields belong in which directories.
draftonly inblog/,sensor_typeonly inprojects/alpha/notes/. Captured asallowedandrequiredglob patterns. - Nullability — whether a field can be null. Tracked per field.
This means different directories can have different fields with different constraints — all inferred automatically from your existing files.
Tightest fit:
mdvs initinfers the strictest schema that’s consistent with your existing files. A field is inferred as allowed in a directory if at least one file there has it. It’s inferred as required if every file there has it. These rules propagate up — if every subdirectory requires a field, the parent directory does too. The result is the tightest set of constraints wherecheckstill returns zero violations. You can always loosen them later.
Two layers
mdvs has two distinct capabilities that work independently:
Validation — Scan your files, infer what frontmatter fields exist, which directories they appear in, and what types they have. Write the result to mdvs.toml. Then validate files against that schema. No model, no index, nothing to download.
Search — Chunk your markdown, embed it with a lightweight local model, store the chunks and vectors in a Lance dataset under .mdvs/, and query with natural language. Choose semantic (vector), full-text (BM25), or hybrid (both, reranked) — and filter results on any frontmatter field using standard SQL.
You need validation without search? Run mdvs init, customize the fields in mdvs.toml, and run mdvs check.
You want search without validation? Just run mdvs init and mdvs search. The inferred schema is used to extract metadata for search results, but you don’t have to worry about it if you don’t want to.
Use them together for the best experience, or separately if that’s what you need.
Using a nested directory of markdown files as a database
You can think of mdvs as a layer on top of your markdown files that gives you database-like capabilities. Here’s a rough mapping of concepts and commands:
| Concept | Database | mdvs |
|---|---|---|
| Define structure | CREATE TABLE | mdvs init |
| Per-table columns | Different columns per table | Per-directory fields via allowed/required globs |
| Enforce constraints | Constraint validation | mdvs check |
| Evolve structure | ALTER TABLE | mdvs update |
| Create an index | CREATE INDEX | mdvs build |
| Query | SELECT ... WHERE ... ORDER BY | mdvs search --where |
Two artifacts: mdvs.toml (your schema, to be committed) and .mdvs/ (the search index, can be ignored by version control).
What this book covers
This book uses a fictional research lab knowledge base (example_kb) as a running example. Every command, every output, every query is real and reproducible.
- Getting Started — Install mdvs and run it on the example vault
- Concepts — How schema inference, types, and validation work
- Commands — Full reference for all 8 commands
- Configuration — The
mdvs.tomlfile explained - Search Guide — SQL filtering, array queries, and ranking
- Recipes — Obsidian setup, CI integration
Getting Started
Install mdvs, run it on a real directory, and search your first query — all in under five minutes.
Install
cargo install mdvs
You need a working Rust toolchain. Prebuilt binaries will be available once the crate is published.
Get the example files
This book uses a fixture called example_kb — a fictional research lab’s knowledge base with ~46 markdown files, varied frontmatter, and a few deliberate inconsistencies. Clone the repo to follow along:
git clone https://github.com/edochi/mdvs.git
cd mdvs
Initialize
Run mdvs init on the example directory:
mdvs init example_kb
mdvs scans every markdown file, extracts frontmatter, and infers a typed schema. Each discovered field is shown as its own key-value table:
Initialized 43 files — 37 field(s)
┌ draft ───────────────────┬───────────────────────────────────────────────────┐
│ type │ Boolean │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 8 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ blog/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ blog/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
...
┌ sensor_type ─────────────┬───────────────────────────────────────────────────┐
│ type │ String │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 3 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ projects/alpha/notes/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ projects/alpha/notes/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
...
┌ title ───────────────────┬───────────────────────────────────────────────────┐
│ type │ String │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 37 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ blog/** │
│ │ meetings/** │
│ │ people/** │
│ │ projects/** │
│ │ reference/protocols/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ blog/** │
│ │ meetings/** │
│ │ people/** │
│ │ projects/** │
│ │ reference/protocols/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
Initialized mdvs in 'example_kb'
That command did three things:
- Scanned 43 markdown files and extracted their YAML frontmatter
- Inferred 37 typed fields — strings, integers, floats, booleans, arrays, even a nested object (
calibration) - Wrote
mdvs.tomlwith the inferred schema
Notice the files row: draft appears in 8 out of 43 files — all in blog/. sensor_type in 3 out of 43 — all in projects/alpha/notes/. mdvs captured not just the types, but where each field belongs, via the required and allowed glob patterns.
Here’s what a field definition looks like in mdvs.toml:
[[fields.field]]
name = "sensor_type"
type = "String"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = false
This means sensor_type is allowed only in experiment notes, and required there. If it appears in a blog post, check will flag it. If it’s missing from an experiment note, check will flag that too.
One artifact is created by init: mdvs.toml — the schema file. Commit this to version control. The .mdvs/ directory (search index) is created later on first build or search.
Validate
Check that every file conforms to the schema:
mdvs check example_kb
Checked 43 files — no violations
Since mdvs init just inferred the schema from these same files, everything passes. The power of check comes after you tighten the schema — or when files drift from it. Try adding sensor_type: SPR-A1 to a blog post — mdvs will flag it as Disallowed because that field doesn’t belong there.
What violations look like
Open mdvs.toml and make a few changes to tighten the constraints:
- Require
observation_notesin all experiment files (currently optional) - Change
convergence_mstype fromIntegertoBoolean(simulating a type mismatch) - Set
drift_rateto non-nullable (one file hasdrift_rate: null) - Restrict
firmware_versionto only appear inpeople/interns/**(it currently appears inpeople/*)
Run check again:
mdvs check example_kb
Checked 43 files — 4 violation(s)
Violations (4):
┌ convergence_ms ──────────┬───────────────────────────────────────────────────┐
│ kind │ Wrong type │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ type Boolean │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ projects/beta/notes/initial-findings.md (got Inte │
│ │ ger) │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ kind │ Null value not allowed │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ not nullable │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ projects/alpha/notes/experiment-2.md │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ firmware_version ────────┬───────────────────────────────────────────────────┐
│ kind │ Not allowed │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ allowed in ["people/interns/**"] │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ people/remo.md │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ observation_notes ───────┬───────────────────────────────────────────────────┐
│ kind │ Missing required │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ required in ["projects/alpha/notes/**"] │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ projects/alpha/notes/experiment-1.md │
│ │ projects/alpha/notes/experiment-2.md │
└──────────────────────────┴───────────────────────────────────────────────────┘
Four violation types, each catching a different kind of problem:
| Violation | Meaning |
|---|---|
Missing required | A file in a required path is missing the field |
Wrong type | The value doesn’t match the declared type |
Null value not allowed | The field is present but null, and nullable is false |
Not allowed | The field appears in a file outside its allowed paths |
Each violation table shows the field name, the kind of violation, the violated rule, and the affected files. See check for the full reference.
Revert your changes to mdvs.toml before continuing (or re-run mdvs init example_kb --force to regenerate it).
Search
Query the index with natural language. On first run, search auto-builds the index:
Note: The first
searchorbuilddownloads the embedding model from HuggingFace (~30 MB for the default model). This is a one-time download — subsequent runs use the cached model and start instantly.
mdvs search "calibration" example_kb
Searched "calibration" — 10 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ calibration │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 10 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/alpha/meetings/2031-06-15.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.585 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines │ 14-22 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text │ # Alpha Kickoff — Calibration Campaign ... │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #2 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/alpha/meetings/2031-10-10.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.501 │
...
...
By default mdvs search runs in hybrid mode — it combines a semantic (vector) match with a full-text (BM25) match and reranks the results, so a typo-friendly natural-language query and an exact-keyword query both work. The score is a relevance score from the reranker (higher is better). Pass --mode semantic or --mode fulltext to use one signal alone. The text row shows the best-matching chunk from each file.
Filtering with --where
Add a SQL filter on any frontmatter field:
mdvs search "quantum" example_kb --where "status = 'active'"
Searched "quantum" — 3 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ quantum │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 10 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/beta/overview.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.123 │
...
...
Only files with status: active in their frontmatter are included. The --where clause supports any SQL expression — boolean logic, comparisons, array functions, and more. See the Search Guide for the full syntax.
What’s next
- Concepts — How schema inference, types, and validation work under the hood
- Commands — Full reference for every command and flag
- Configuration — Customize
mdvs.tomlto tighten your schema - Search Guide — Complex queries: arrays, nested objects, combined filters
Concepts
mdvs has two layers — validation and search — each with its own set of concepts. These pages explain how things work under the hood.
- Types & Widening — The type system, how types are inferred from values, and what happens when files disagree
- Schema Inference — How mdvs scans your directory and computes field paths, requirements, and constraints
- Validation — What
checkverifies, the five violation types, and how to read the output - Constraints — Categorical constraints, auto-inference heuristics, and manual overrides
- Search & Indexing — Chunking, embeddings, incremental builds, and how results are ranked
Types & Widening
mdvs infers a type for every frontmatter field it encounters. When the same field appears with different types across files, mdvs resolves the conflict automatically through type widening.
The supported types
| Type | YAML example | example_kb field |
|---|---|---|
| Boolean | draft: false | draft in blog posts |
| Integer | sample_count: 24 | sample_count in experiments |
| Float | drift_rate: 0.023 | drift_rate in experiments |
| String | author: Giulia Ferretti | author across many files |
| Date | joined: 2023-02-01 | joined, date, commission_date, last_reviewed |
| DateTime | synced_at: "2024-04-02T16:14:30+02:00" | synced_at in experiments |
| Array(Scalar) | tags: [calibration, SPR-A1] | tags in projects and blog |
The on-disk type grammar is tight:
Type := Scalar | Array(Scalar)
Scalar := String | Integer | Float | Boolean | Date | DateTime
Array(Array(...)) and Array(Object{...}) are not representable on disk — see Arrays of structured items below for the workaround.
Date and DateTime are described in detail in Date and DateTime below.
Nested Objects in YAML are expressed as dotted-name leaf fields in mdvs.toml. A frontmatter shape like:
calibration:
baseline:
wavelength: 632.8
intensity: 0.95
adjusted:
wavelength: 633.1
intensity: 0.97
infers as five separate leaf fields, one per nested path:
calibration.baseline.wavelength→ Floatcalibration.baseline.intensity→ Floatcalibration.adjusted.wavelength→ Floatcalibration.adjusted.intensity→ Float
Each leaf gets its own nullability and allowed/required glob set. This avoids the readability and per-leaf-validation problems of monolithic Object types. Top-level Object types are not supported in mdvs.toml, and neither are Objects nested inside Array fields — see Arrays of structured items below.
Arrays of structured items
A YAML field like:
measurements:
- timestamp: "14:02:11"
value: 0.612
- timestamp: "14:03:00"
value: 0.598
has no first-class representation on disk in v0. Inference detects the Array(Object{...}) shape, skips the field, and emits a warning to stderr:
warning: skipped field 'measurements' — Array(Object{...}) isn't representable on disk.
Consider parallel scalar arrays (see TODO-0156). (first observed in projects/alpha/notes/experiment-2.md)
The recommended workaround is parallel scalar arrays — one field per element-leaf. Replace the YAML above with:
measurement_timestamps: ["14:02:11", "14:03:00"]
measurement_values: [0.612, 0.598]
and the corresponding mdvs.toml:
[[fields.field]]
name = "measurement_timestamps"
type = "Array(String)"
[[fields.field]]
name = "measurement_values"
type = "Array(Float)"
The downside is the loss of per-element grouping — there’s no schema-level guarantee that measurement_timestamps[3] and measurement_values[3] belong to the same record. A first-class Array-of-structured-item representation is tracked in TODO-0156.
Date and DateTime
Both types use RFC 3339 as the canonical wire format — a strict subset of ISO 8601 designed for machine interoperability.
Date — calendar date, no time
Date = YYYY-MM-DD
Rules:
- Exactly 4-digit year, 2-digit month, 2-digit day (no
2024-1-1shorthand). - Hyphen separators.
- Calendar-valid:
2024-13-01(month 13) and2024-02-30(no Feb 30th) are rejected. - No time component, no timezone.
Accepted:
2023-02-01
1990-05-12
2024-02-29 ← valid leap-year date
Rejected:
2024-1-1 ← single-digit components not allowed
2024-13-01 ← month must be 01-12
"see 2024-01-15" ← must be the whole string
2024/01/15 ← only hyphens
Stored as Arrow Date32 (days since 1970-01-01). Native date arithmetic works in --where queries — e.g. WHERE date > '2024-01-01', WHERE date_part('year', published) = 2024, WHERE date BETWEEN '2024-06-01' AND '2024-06-30'. See Date and DateTime in –where queries for worked examples including EXTRACT, INTERVAL, date subtraction, and compound filters.
DateTime — date + time, mandatory timezone
DateTime = YYYY-MM-DDTHH:MM:SS[.frac]<tz>
<tz> = 'Z' ← UTC shorthand
| '+HH:MM' ← positive offset
| '-HH:MM' ← negative offset
Rules:
- Date part: same as
Dateabove. Tseparator between date and time is mandatory — no space alternative.HH:MM:SS(24-hour, all two digits). Seconds are required.- Fractional seconds optional, any number of digits.
- Timezone is mandatory — naive
2024-01-15T14:30:00is rejected (not valid RFC 3339).
Accepted:
2024-01-15T14:30:00Z ← Zulu = UTC
2024-01-15T14:30:00+00:00 ← same moment, explicit offset
2024-04-02T16:14:30+02:00 ← positive offset
2024-01-15T14:30:00-08:00 ← negative offset
2024-01-15T14:30:00.123Z ← fractional seconds
2024-01-15T14:30:00.123456789Z ← nanosecond precision
Rejected:
2024-01-15T14:30:00 ← no timezone
2024-01-15 14:30:00Z ← space instead of T
2024-01-15T14:30 ← seconds required
2024-13-01T14:30:00Z ← invalid month
2024-01-15T25:30:00Z ← invalid hour
Stored as Arrow Timestamp(Millisecond, "UTC"). Offsets are normalized to UTC at storage time — 2024-04-02T16:14:30+02:00 and 2024-04-02T14:14:30Z are the same absolute moment and store identically. The original offset is intentionally not preserved.
example_kb demonstration
Both types are auto-inferred in the example vault:
| Field | Type | Files |
|---|---|---|
joined | Date | people/** |
date | Date | meetings + blog/published |
commission_date | Date | people/* |
last_reviewed | Date | reference/protocols/** |
synced_at | DateTime | experiment-1.md uses Z, experiment-2.md uses +02:00 |
No manual configuration was needed for any of these — inference detects the RFC 3339 shape and assigns the appropriate type. See Type widening in practice below for the inference rule.
Validation
JSON Schema’s format: date and format: date-time keywords validate values at check time. Bad shapes (invalid calendar dates, missing timezones, wrong separators) produce WrongType violations with a rule like format date or format date-time.
Constraints
categoriesapplies (e.g.categories = ["2024-01-01", "2024-12-31"]on a Date field; values are strings, the runtime format validator catches malformed entries).pattern,min,max,min_length,max_lengthdo not apply — the type’s format is itself the pattern. Bounded date ranges (e.g. “published in 2024”) are tracked as a future feature.
Preprocessors
No preprocessor applies to Date or DateTime in v1. Unlike String (which can opt in to coerce-to-string) or Float (which can opt in to widen-int-to-float), date types are strict — either the string parses as RFC 3339 or it doesn’t.
Type hierarchy
When two values have different types, mdvs widens to a common type. The hierarchy looks like this:
graph BT
Integer --> Float
Float --> String
Boolean --> String
Date --> String
DateTime --> String
Array["Array(T)"] --> String
Each arrow means “widens to.” String is the top type — every type eventually reaches it.
The one special case is Integer → Float: integers widen to floats (not directly to String) because the conversion is lossless. Date and DateTime have no internal cross-promotion — mixed Date + DateTime observations widen to String (the two shapes are disjoint).
Two same-category combinations widen internally instead of jumping to String:
- Array + Array — element types are widened recursively (e.g.,
Array(Integer)+Array(String)→Array(String)) - Object + Object — at the leaf level: each dotted path’s type is widened independently across files. A file with
cal.wave = 850(Integer) and another withcal.wave = 632.8(Float) yieldscal.wave: Float. New leaf paths in some files are added to the schema; leaves absent from some files affect nullability/required-globs naturally.
Everything else (Boolean + any other type, Array + scalar, Object + scalar) widens to String. The one exception is Array containing Object — Array(Object{...}) isn’t representable on disk, so inference drops the field with a warning instead of widening to String (see Arrays of structured items).
Type widening in practice
When mdvs scans your files and the same field has different types, it picks the least upper bound — the most specific type that covers all observed values.
Integer + Float → Float
In example_kb, the wavelength_nm field appears in three experiment notes:
# experiment-1.md
wavelength_nm: 850 # Integer
# experiment-2.md
wavelength_nm: 632.8 # Float
# experiment-3.md
wavelength_nm: 780.0 # Float
Result: wavelength_nm is inferred as Float. The integer 850 is safely represented as a float.
Integer + String → String
The priority field uses numbers in one project and text in another:
# projects/alpha/overview.md
priority: 1 # Integer
# projects/beta/overview.md
priority: high # String
Result: priority is inferred as String. There’s no numeric type that can hold "high", so mdvs widens to String.
Boolean + any non-Boolean → String
If the same field is true in one file and 3 in another, there’s no numeric or boolean type that can hold both. The result is String.
This doesn’t happen in example_kb because booleans (draft) are used consistently — but it’s a common mistake in organically grown vaults where someone writes draft: yes (String) instead of draft: true (Boolean).
Date and DateTime inference
A string is inferred as Date or DateTime when every observation across all files matches the RFC 3339 shape AND parses as a real value. A single non-matching value downgrades the whole field to String.
Pure-date observations across files:
# people/alice.md
joined: 2023-02-01
# people/bob.md
joined: 2024-09-15
Result: joined is inferred as Date.
One non-date value forces String:
# people/alice.md
joined: 2023-02-01
# people/carol.md
joined: "see HR records" # not a date
Result: joined widens to String — the second observation can’t be typed as Date, and Date + String → String is the widening rule.
Same logic for invalid calendar dates:
# fileA.md
published: 2024-06-01
# fileB.md
published: 2024-13-01 # invalid month — typed String per-value
Result: published widens to String. The typo gets silently absorbed into String typing; the user only catches it via a WrongType violation if they manually set type = "Date" in mdvs.toml.
Date + DateTime are cross-shape — never auto-promote:
# meeting/a.md
when: 2024-01-15 # Date
# meeting/b.md
when: 2024-01-15T14:30:00Z # DateTime
Result: when widens to String. Pick one shape consistently to get a typed field.
Array element widening
The tags field is a string array in most files, but one file accidentally used integers:
# projects/alpha/overview.md
tags:
- biosensor
- metamaterial # Array(String)
# projects/beta/notes/replication.md
tags:
- 1
- 2
- 3 # Array(Integer)
Result: tags is inferred as Array(String). The array element types (String vs Integer) are widened to String, giving Array(String).
Object leaf merging (dotted-name flattening)
When two files have nested keys at the same paths, each leaf is inferred independently. New leaves seen in one file but not another are added to the schema; their required glob naturally narrows to just the files that contain them.
In example_kb, the calibration object appears in two experiment files with different structures:
# experiment-1.md (simpler calibration, integer values)
calibration:
baseline:
wavelength: 850 # Integer
intensity: 1 # Integer
notes: "initial reference" # only in this file
# experiment-2.md (full calibration, float values)
calibration:
baseline:
wavelength: 632.8 # Float
intensity: 0.95 # Float
adjusted: # only in this file
wavelength: 633.1
intensity: 0.97
Result: five dotted-name leaf fields are inferred in mdvs.toml:
[[fields.field]]
name = "calibration.adjusted.intensity"
type = "Float"
[[fields.field]]
name = "calibration.adjusted.wavelength"
type = "Float"
[[fields.field]]
name = "calibration.baseline.intensity"
type = "Float"
preprocess = ["widen-int-to-float"] # Integer + Float mix → opted in
[[fields.field]]
name = "calibration.baseline.notes"
type = "String"
[[fields.field]]
name = "calibration.baseline.wavelength"
type = "Float"
preprocess = ["widen-int-to-float"]
What happened:
calibration.baseline.wavelengthseen as both Integer (850) and Float (632.8) → widened to Float withwiden-int-to-floatpreprocessor recording the mixcalibration.baseline.intensitysimilar: Integer (1) + Float (0.95) → Float with the preprocessorcalibration.baseline.notesonly in experiment-1 → still inferred as String (with arequiredglob narrowed to just the files that have it)calibration.adjusted.*only in experiment-2 → inferred from that file alone
The user-facing schema is flat, but its semantics still match the YAML’s nested shape. Validation, storage, and --where queries all operate on the natural nested structure — the dotted-name form is purely a mdvs.toml UX choice.
The full widening matrix
Every possible combination of types and its result:
| Boolean | Integer | Float | String | Date | DateTime | Array | Object | |
|---|---|---|---|---|---|---|---|---|
| Boolean | Boolean | String | String | String | String | String | String | String |
| Integer | String | Integer | Float | String | String | String | String | String |
| Float | String | Float | Float | String | String | String | String | String |
| String | String | String | String | String | String | String | String | String |
| Date | String | String | String | String | Date | String | String | String |
| DateTime | String | String | String | String | String | DateTime | String | String |
| Array | String | String | String | String | String | String | Array* | dropped** |
| Object | String | String | String | String | String | String | dropped** | Object* |
* Array + Array: element types are widened recursively.
* Object + Object: not a top-level on-disk type. Nested Objects in YAML flatten to dotted-name leaves before widening; each leaf path is widened independently.
** Inference observed Array(Object{…}) — not representable on disk in v0. The field is dropped from the schema and a warning is emitted (see Arrays of structured items).
Date and DateTime are cross-shape — they never auto-promote into each other. The single non-trivial pair is Date + DateTime → String.
The matrix is symmetric — widen(A, B) always equals widen(B, A).
Nullable
Separately from the type, mdvs tracks whether null was observed for a field. This is shown as a ? suffix in output — e.g., Float? means “Float, but sometimes null.”
How it works
In example_kb, the drift_rate field is Float in two experiment files but null in a third:
# experiment-1.md
drift_rate: 0.023 # Float
# experiment-2.md
drift_rate: null # sensor malfunction — Giulia discarded the data
# experiment-3.md
drift_rate: 0.012 # Float
Result: drift_rate is inferred as Float? — the type is Float (null doesn’t affect the type), and nullable is set to true.
Null-only fields
If the only value ever observed is null, the type defaults to String:
# blog/drafts/grant-ideas.md
review_score: null # no real values seen
Result: review_score is inferred as String?.
Key rules
- Null is transparent in widening — it doesn’t affect the inferred type
- Null-only fields default to String (the safest fallback)
nullableis a separate boolean, not part of the type itself- In validation: null values skip type checks, but a non-nullable required field with a null value triggers a
NullNotAllowedviolation (see Validation)
Widening and preprocessors
Widening picks the type. Preprocessors are how the schema declares what coercions were needed to get there. Inference auto-populates them — you rarely write them by hand.
When inference observes a field as a mix of types (some files have priority: 1, others priority: high), it widens to String and writes:
[[fields.field]]
name = "priority"
type = "String"
preprocess = ["coerce-to-string"]
The coerce-to-string entry tells validation: “before checking this value is a string, serialize whatever you find to its JSON representation.” Without it, the field is strict — integers and booleans fail validation.
Same for Float: a mix of 5 and 5.0 widens to Float with preprocess = ["widen-int-to-float"]. Without it, integers fail the float check.
The two built-in Stage 2 preprocessors:
| Preprocessor | Applies to | Effect |
|---|---|---|
coerce-to-string | String, Array(String) | Serialize non-strings to their JSON string representation before validation |
widen-int-to-float | Float, Array(Float) | Treat integer values as their float equivalent |
preprocess = [] means strict. If you delete a preprocessor from mdvs.toml, the field rejects values that would have been coerced. Conversely, you can hand-add a preprocessor to a strict-inferred field if you want to accept type variation.
No preprocessor applies to Date or DateTime. Those types are strict by design — values either parse as RFC 3339 or they don’t. There is no parse-loose-date opt-in; non-ISO formats fall back to String (and the user can add a pattern constraint if they want a custom shape).
In storage — when validation accepts a coerced value, the coerced form is what gets stored. A priority: 1 value with coerce-to-string becomes "1" in the search index. No data is silently dropped.
Re-run mdvs update reinfer <field> to refresh both the inferred type and the inferred preprocessors after editing source files.
Edge cases
- Empty arrays
[]default toArray(String)— if real values are added later, the field must be re-inferred withmdvs update reinfer <field>to pick up the new element type - Empty frontmatter (
---followed immediately by---) is a file with zero fields — not a bare file. It still counts as “having frontmatter” for inference purposes. - Bare files (no
---fences at all) are handled differently — see Schema Inference
Schema Inference
mdvs infers a typed schema from your files automatically — no manual schema definition needed. Run mdvs init, and it scans every markdown file, extracts frontmatter, infers types, and computes path patterns that describe where each field appears. The result is mdvs.toml, which you can then tighten by hand.
What gets scanned
mdvs walks your directory and includes every .md and .markdown file that matches the glob pattern in [scan]:
[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
Three settings control what’s included:
| Setting | Default | Effect |
|---|---|---|
glob | "**" | Which files to scan. Use narrower globs to exclude subtrees. |
include_bare_files | true | Whether to include files without any YAML frontmatter |
skip_gitignore | false | Whether to ignore .gitignore patterns during scan |
mdvs also respects .mdvsignore files (same syntax as .gitignore) for excluding paths from scanning without touching your .gitignore.
Bare files vs empty frontmatter
These look similar but are different:
Bare file — no frontmatter fences at all:
This file has no frontmatter. Just content.
Empty frontmatter — fences with nothing between them:
---
---
This file has frontmatter, but zero fields.
In example_kb, four files are bare (scratch.md, lab-values.md, reference/tools.md, reference/glossary.md) and one has empty frontmatter (reference/quick-start.md).
Both types contribute zero fields to inference. The difference matters for validation: a bare file is excluded entirely when include_bare_files = false, while an empty-frontmatter file is always included (it has frontmatter — just none with fields).
From files to fields
For each scanned file, mdvs extracts the YAML frontmatter and infers a type for every key. When the same field appears across multiple files, its type is widened to a common type (see Types & Widening for the full rules).
In example_kb, scanning 43 files produces 37 distinct field names. Some fields like title appear in 37 files. Others like unit_id appear in just one.
The output of this step is a list of fields, each with:
- A name
- A type (widened across all files where it appears)
- A nullable flag (true if any file had a
nullvalue) - The set of files where it was found
Path patterns
The most interesting part of inference is how mdvs computes where each field belongs. It produces two sets of glob patterns per field:
allowed— where the field may appear. Any file matching these patterns can have the field without triggering a violation.required— where the field must appear. Any file matching these patterns that’s missing the field triggers aMissingRequiredviolation.
How patterns are computed
mdvs builds a directory tree from the scanned files and works bottom-up:
- For each directory, it tracks which fields appear in all files (intersection) and which appear in any file (union)
- When a field appears in every file under a directory and its subdirectories, it collapses into a recursive glob (
dir/**) - When a field appears in some but not all files, only
allowedgets the glob —requireddoes not
The result is a minimal set of globs that describes the field’s distribution.
Examples from example_kb
Narrow and consistent — sensor_type appears in all three experiment notes and nowhere else:
[[fields.field]]
name = "sensor_type"
type = "String"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
allowed and required are the same — every file that has this field is in the same directory, and every file in that directory has it.
Broad and consistent — title appears in 37 of 43 files across many directories:
[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
required = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
Again, allowed equals required — every file in those directories has a title. The five directories without title are bare files at the root and in reference/.
Allowed broader than required — email exists in all people/ files except one:
[[fields.field]]
name = "email"
type = "String"
allowed = ["people/**"]
required = ["people/interns/**"]
allowed is people/** — the field may appear anywhere under people/. But required is only people/interns/** — the one subdirectory where every file happens to have it. In people/* (the non-intern profiles), some have email and some don’t, so it can’t be required there.
Present but never required — ambient_humidity appears in only one of three experiment notes:
[[fields.field]]
name = "ambient_humidity"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = []
required is empty — the field never appears in every file under any directory, so mdvs can’t require it anywhere.
The pattern
The general rule is required ⊆ allowed — you can’t require a field somewhere it’s not allowed. Within that:
required = allowedwhen every file in a directory has the fieldrequired ⊂ allowedwhen the field is consistent in some directories but sporadic in othersrequired = []when the field is sporadic — present in some files but not consistently in any directory
The three field states
Every field in mdvs.toml is in one of three states:
Constrained
Listed under [[fields.field]]. Validation enforces type, allowed paths, required paths, and nullable. mdvs update preserves constrained fields unless you explicitly use update reinfer.
[[fields.field]]
name = "draft"
type = "Boolean"
allowed = ["blog/**"]
required = ["blog/**"]
nullable = false
Only name is required — properties you omit use permissive defaults:
| Property | Default | Meaning |
|---|---|---|
type | String | Strict string check (add preprocess = ["coerce-to-string"] to accept any JSON value) |
allowed | ["**"] | Allowed in every file |
required | [] | Not required anywhere |
nullable | true | Null values accepted |
preprocess | [] | No value coercion before validation |
A [[fields.field]] with just a name is effectively unconstrained, but still known — useful when you want to acknowledge a field without committing to specific constraints yet.
Ignored
Listed in the ignore array. The field is known but not validated — no type checks, no path checks. mdvs update skips ignored fields entirely.
[fields]
ignore = ["internal_notes", "scratch_data"]
Use this for fields you don’t want to enforce — temporary fields, fields in flux, or fields you’ve decided aren’t worth constraining.
Unknown
Not mentioned in mdvs.toml at all. When mdvs update finds a field that isn’t constrained or ignored, it reports it as a new field and adds it to the schema.
A field can be in exactly one state. Moving a field from constrained to ignored means removing its [[fields.field]] entry and adding its name to ignore. Moving it back means the reverse.
Keeping the schema current
After initial inference with mdvs init, the schema is a snapshot of your files at that moment. As files change — new fields appear, old ones shift — use mdvs update to bring the schema up to date.
Default mode
mdvs update example_kb
Only new fields are added. Existing fields are left untouched, even if their types or paths have changed. This is conservative by design — your manual edits to mdvs.toml are preserved.
Fields that disappear from all files still stay in the toml. This prevents accidental removal when files are temporarily missing.
Re-inferring specific fields
mdvs update example_kb reinfer tags
Treats tags as if it had never been seen — removes it from the schema, re-scans, and infers it fresh. Use this when you’ve fixed bad data (like a tags: [1, 2, 3] that should have been strings) and want the type or paths to update.
Re-inferring everything
mdvs update example_kb reinfer
When no fields are named, every field is reinferred. The entire [[fields.field]] section is rebuilt from scratch, but all other config ([scan], [embedding_model], etc.) is preserved.
This is different from mdvs init --force, which overwrites the entire mdvs.toml including non-field config.
Edge cases
- Fields in a single file — get a narrow
allowedglob matching just that file’s directory. Example:unit_idonly inpeople/remo.md→allowed = ["people/*"]. - Null-only fields — type defaults to String (see Types & Widening). Example:
review_scoreis alwaysnull→String?. - Special characters in field names — names with spaces (
lab section), single quotes (author's_note), or double quotes (notes"v2") are preserved as-is. They need quoting in--whereclauses (see Search Guide). - Empty arrays
[]— element type defaults to String, givingArray(String). If real values appear later, useupdate reinferto pick up the correct element type. - Nested objects in frontmatter — flattened into dotted-name leaf fields. A YAML key like
calibration: { baseline: { wavelength: 850.0 } }becomes a[[fields.field]]entry namedcalibration.baseline.wavelengthwith typeFloat. Each leaf gets its own nullability andallowed/requiredglob set. Top-level Object types are not supported inmdvs.toml; only nested Objects inside Array fields keep their inline shape (see Types & Widening).
Validation
mdvs check validates every file’s frontmatter against the schema in mdvs.toml. It’s read-only, deterministic, and produces no side effects — it just tells you what’s wrong.
The seven violations
| Violation | Meaning |
|---|---|
WrongType | The value doesn’t match the declared type (or fails a pattern regex) |
Disallowed | The field appears in a file outside its allowed paths |
MissingRequired | A file matches a required glob but doesn’t have the field |
NullNotAllowed | The field is present but null, and nullable is false |
InvalidCategory | The value is not in the field’s declared categories |
OutOfRange | A numeric value violates min/max, or a length violates min_length/max_length |
FrontmatterUnrepresentable | The file’s frontmatter can’t be represented as JSON (NaN/inf, non-string keys, non-object top-level) |
WrongType
Fires when a value doesn’t match the declared type. If convergence_ms is declared as Boolean but a file has convergence_ms: 42, the integer value fails the boolean check.
This violation has two important leniencies — see Type checking rules below.
Disallowed
Fires when a field appears in a file whose path doesn’t match any of the field’s allowed globs. For example, if firmware_version has allowed = ["people/interns/**"] but appears in people/remo.md, that file is outside the allowed paths.
MissingRequired
Fires when a file’s path matches one of the field’s required globs, but the file doesn’t contain that field at all.
For example, if observation_notes has required = ["projects/alpha/notes/**"], then every file under projects/alpha/notes/ must have it. Files that don’t → MissingRequired.
NullNotAllowed
Fires when a field is present with an explicit null value, but nullable is false. For example, if drift_rate has nullable = false and a file has drift_rate: null.
This is distinct from a missing field — see Null vs absent below.
InvalidCategory
Fires when a field has a categories constraint and the value is not in the declared list. For example, if status has categories = ["draft", "published", "archived"] and a file has status: pending, the value "pending" is not in the list.
For array fields, each element is checked individually. The violation detail lists the specific offending elements.
This check only runs on non-null values that pass the type check. If the value has the wrong type, only WrongType fires — InvalidCategory is skipped. If the value is null and the field is nullable, the category check is skipped entirely.
See Constraints for how categories are configured and auto-inferred.
OutOfRange
Fires when a value violates a numeric or length bound:
min/maxon numeric fields —rating: 7withmin = 1, max = 5is abovemax.min_length/max_lengthon string fields —slug: "a"withmin_length = 3is too short.min_items/max_itemson array fields (when emitted by inference) — applies to the array’s length.
For array fields, numeric-element bounds are checked individually. The violation detail lists the specific offending elements or, for length checks, the actual length.
This check only runs on non-null values that pass the type check, same as InvalidCategory.
See Constraints for how bounds are configured.
FrontmatterUnrepresentable
Fires when a file’s YAML frontmatter parses successfully but can’t be represented as JSON. Causes include NaN / inf floats, non-string mapping keys, or a top-level value that isn’t a mapping. The violation is reported at the document level with the sentinel field name <frontmatter>.
Pre-Wave-B mdvs silently dropped these files; they’re now surfaced explicitly so the schema can’t lie about what’s actually in your vault.
Type checking rules
Type checking is strict — a String field rejects integers, a Boolean field rejects strings, and so on. Two opt-in adjustments cover the common YAML pain points:
Preprocessors normalize before validation. A field’s preprocess array runs before jsonschema sees the value. Two built-ins:
coerce-to-string— non-string values (booleans, integers, arrays) are serialized to their JSON string representation, then validated as strings. Auto-inferred when the inferred type widened toStringbecause of mixed-type observations.widen-int-to-float— integers are widened to equivalent floats. Auto-inferred when the inferred type widened toFloatbecause some files used5and others5.0. Without it, a Float field rejects integer values.
Fields with empty preprocess arrays are validated strictly — there are no implicit leniencies. See Types & Widening for how inference picks the preprocessors.
Recursion. Arrays check element types recursively — an Array(Integer) field rejects ["a", "b"] because the string elements fail the Integer check. Nested frontmatter structure is validated per leaf: a config entry named calibration.baseline.wavelength is checked against the value at the corresponding nested path in the YAML. Missing intermediate Objects mean the leaf is absent — handled by the MissingRequired check.
Pattern. A pattern constraint on a String field is enforced as a regex; pattern failures surface as WrongType (with detail naming the offending value).
Date and DateTime format validation. Date and DateTime fields use JSON Schema’s format: date / format: date-time keywords. Non-conforming values (invalid calendar dates, missing timezones, wrong separators) fire WrongType with a rule like format date or format date-time. See Date and DateTime for the exact accepted shapes.
Engine
Per-value validation runs through the jsonschema crate. mdvs translates mdvs.toml’s [fields] block into a JSON Schema 2020-12 document, compiles one validator per field, runs Stage 2 preprocessors, then validates each value. Errors from jsonschema are mapped exhaustively into the seven ViolationKinds above.
One subtype check runs in Rust ahead of jsonschema: a Float field without widen-int-to-float rejects integer-backed values (5 is rejected, 5.0 is accepted). JSON Schema’s "number" accepts both — but YAML and TOML preserve the int/float distinction at parse time, and so does mdvs.
Null handling
Null interacts with validation in specific ways:
The checks are independent. A null value is checked like any other value — each violation type is evaluated separately:
WrongType— null is accepted by any type, so this never fires on null.Disallowed— the field is present (the key exists), soDisallowedfires if the path isn’t inallowed.MissingRequired— null counts as “present”, so this never fires on null.NullNotAllowed— fires when the value is null andnullable = false.InvalidCategory— null skips the category check (same asWrongType), so this never fires on null.OutOfRange— null skips the range check (same asInvalidCategory), so this never fires on null.
A single null field can trigger both Disallowed and NullNotAllowed at the same time.
Null vs absent. These are different situations with different outcomes:
| Situation | Example | Result |
|---|---|---|
| Field is absent | File has no drift_rate key at all | MissingRequired (if path matches required) |
Field is null, nullable = true | drift_rate: null | Passes |
Field is null, nullable = false | drift_rate: null | NullNotAllowed |
A null value counts as “present” — the field key exists in the frontmatter, it just has no value. So null never triggers MissingRequired. An absent field is genuinely missing — it can trigger MissingRequired but never NullNotAllowed.
Note: In YAML, unquoted
nullis a null value, not the string"null". To store the literal string, writedrift_rate: "null"(with quotes).
New fields
When mdvs check encounters a frontmatter field that isn’t in mdvs.toml — neither constrained under [[fields.field]] nor listed in ignore — it reports it as a new field.
New fields are informational only. They don’t count as violations and don’t affect the exit code:
Checked 43 files — no violations, 1 new field(s)
╭──────────────────────────────┬─────────────────────┬─────────────────────────╮
│ "algorithm" │ new │ 2 files │
╰──────────────────────────────┴─────────────────────┴─────────────────────────╯
They’re shown in the output so you know to either run mdvs update to add them to the schema, or add them to the ignore list.
Bare files
When include_bare_files = true in [scan], bare files (no frontmatter at all) are included in validation. Since they have no fields, they trigger MissingRequired for any required glob matching their path.
For example, if title has required = ["**"] and scratch.md is a bare file, it triggers MissingRequired for title. This is often why the inferred schema uses narrower required globs — bare files at the root prevent required = ["**"] from being inferred for fields that don’t appear in them.
Check and build
mdvs build runs the same validation internally before embedding. If any violations are found, build aborts — no dirty data reaches the index. The violations are the same ones check would report.
This means you can use check as a dry run before building, but you don’t have to — build will catch the same problems.
Exit codes
| Exit code | Meaning |
|---|---|
| 0 | No violations (new fields don’t count) |
| 1 | One or more violations found |
| 2 | Scan or config error (couldn’t run validation) |
Constraints
Constraints are validation rules that go beyond type checking. While types ensure a value is a String or Integer, constraints refine what values are actually valid — for example, restricting a String field to a specific set of allowed values.
Constraints are not a new type. They’re an optional layer on top of the existing type system. A field without constraints is validated by type alone; a field with constraints gets an additional check.
Categories
The categories constraint restricts a field’s values to a declared set. It applies to:
- String — the value must be one of the listed strings
- Integer — the value must be one of the listed integers
- Date — each category is a string in RFC 3339 full-date shape
- DateTime — each category is a string in RFC 3339 datetime shape
- Array(String), Array(Integer), Array(Date), Array(DateTime) — each element must be one of the listed values
Boolean, Float, and Object fields don’t support categories — Boolean is already two-valued, Float is continuous, and Object is structural.
TOML representation
Categories live in a [fields.field.constraints] sub-table:
[[fields.field]]
name = "status"
type = "String"
allowed = ["**"]
required = ["blog/**"]
nullable = false
[fields.field.constraints]
categories = ["active", "archived", "completed", "draft", "published"]
Integer categories:
[[fields.field]]
name = "priority"
type = "Integer"
[fields.field.constraints]
categories = [1, 2, 3]
Array categories constrain each element:
[[fields.field]]
name = "tags"
type = "Array(String)"
[fields.field.constraints]
categories = ["go", "python", "rust"]
A field without a [fields.field.constraints] section (or without a categories key) is unconstrained.
Validation
When a value doesn’t match any of the declared categories, check reports an InvalidCategory violation. For arrays, the violation lists the specific offending elements. See Validation for details.
Null values on categorical fields follow the existing nullable logic — if nullable = true, null skips the category check. The category constraint only fires on non-null values that pass the type check.
Auto-inference
During init and update reinfer, mdvs automatically detects categorical fields using a heuristic with two conditions (both must hold):
- Max distinct values — the field has at most
max_categoriesdistinct values (default: 10) - Minimum repetition —
total occurrences / distinct values >= min_category_repetition(default: 3)
For array fields, distinct values and occurrences are counted at the element level.
Examples
statuswith 3 distinct values across 30 files: distinct=3, repetition=10 — categoricaltitlewith 28 distinct values across 30 files: distinct=28 (exceeds cap) — not categoricalauthorwith 5 distinct values across 5 files: repetition=1 (below threshold) — not categorical
Configurable thresholds
The thresholds are configurable in [fields]:
[fields]
max_categories = 10
min_category_repetition = 3
These control automatic inference only. Manually written categories in the TOML are unaffected by thresholds.
CLI flags on update reinfer override the TOML values per-invocation:
mdvs update example_kb reinfer --max-categories 15 --min-repetition 3
Range
The range constraint restricts a numeric field’s value to an inclusive [min, max] interval. It applies to:
- Integer — value must satisfy
min <= value <= max - Float — same, with float comparison
- Array(Integer) — each element must satisfy the range
- Array(Float) — same, element-wise
Both min and max are optional — you can specify just one bound. Boolean, String, Date, DateTime, and Object fields don’t support range. Date / DateTime bounds (e.g. “published after 2024-01-01”) aren’t supported in v1 — they require JSON Schema’s formatMinimum/formatMaximum vocab and are tracked as a follow-up.
TOML representation
[[fields.field]]
name = "rating"
type = "Integer"
[fields.field.constraints]
min = 1
max = 5
Float bounds (with optional integer bound on a Float field — bounds widen to f64 for comparison):
[[fields.field]]
name = "score"
type = "Float"
[fields.field.constraints]
min = 0
max = 100
Array example — each element checked against the bounds:
[[fields.field]]
name = "ratings"
type = "Array(Integer)"
[fields.field.constraints]
min = 1
max = 10
Validation
When a value is out of bounds, check reports an OutOfRange violation with the rule (min = N, max = N) and the offending value. For arrays, the violation lists the specific elements that are out of range.
Null values follow the existing nullable logic — if nullable = true, null skips the range check.
Type rules
Bound types must match the field type:
- Integer fields require integer bounds. Float bounds (e.g.,
min = 0.5) are rejected at config load — likely a mistake; an integer can never equal0.5. - Float fields accept both integer and float bounds (integer bounds widen to
f64).
If both bounds are present, min must be <= max — otherwise rejected at config load.
Manual overrides
Use the --with flag on update reinfer to override the default heuristic for specific fields:
# Force categorical (skip heuristic threshold)
mdvs update example_kb reinfer title --with=categorical
# Infer min/max from observed numeric values
mdvs update example_kb reinfer sample_count --with=range
# Strip all constraints
mdvs update example_kb reinfer status --with=none
--with takes a comma-separated list of constraint kinds: categorical, range, or none. Incompatible kinds (e.g., range,categorical on the same field) are rejected at parse time. --with=none cannot be combined with other kinds. The flag requires named fields.
Manual TOML edit — you can also add or remove constraints by hand. Running update (without reinfer) preserves existing constraints as-is. Only update reinfer re-evaluates them.
Length
The length constraint bounds string length or array length. It applies to:
- String —
min_length <= len(value) <= max_length, where length is the Unicode scalar count - Array(T) —
min_length <= array length <= max_length
[[fields.field]]
name = "slug"
type = "String"
[fields.field.constraints]
min_length = 3
max_length = 64
Both bounds are optional. Integer fields, Float fields, and Boolean fields don’t support length. Length violations surface as OutOfRange. If both bounds are present, min_length <= max_length is enforced at config load.
Pattern
The pattern constraint runs a regular expression against String values:
[[fields.field]]
name = "version"
type = "String"
[fields.field.constraints]
pattern = '^v\d+\.\d+\.\d+$'
The regex is compiled at config load time — invalid syntax fails fast. Pattern is currently String-only. Pattern violations surface as WrongType (with detail naming the offending value). Categorical fields can’t also have a pattern — categories already enumerate the legal forms. Date and DateTime fields don’t accept pattern either — the type’s format is itself the pattern (see Date and DateTime).
Conflicts between constraint kinds
Some combinations are mutually exclusive on the same field:
categories+ anything else — categories enumerate the legal values; other constraints would be redundant or contradictory. Rejected at config load.range+length— range bounds numeric values; length bounds size. They apply to different field types (numeric vs. String/Array), so they should never collide in practice; the check is still enforced.
Compatible combinations: min/max together; min_length/max_length together; pattern with min_length/max_length.
Constraint kinds summary
| Constraint | Field types | Violation |
|---|---|---|
categories | String, Integer, Array(String), Array(Integer) | InvalidCategory |
min / max | Integer, Float, Array(Integer), Array(Float) | OutOfRange |
min_length / max_length | String, Array(T) | OutOfRange |
pattern | String | WrongType |
Each constraint kind is a key in the [fields.field.constraints] sub-table. Compatibility is checked at config load time.
Search & Indexing
mdvs builds a search index by chunking your markdown content, embedding it with a local model, and storing chunks + vectors + frontmatter in a single LanceDB dataset. Queries are served by LanceDB natively — semantic (vector), full-text (BM25), or hybrid (both, reranked) — with optional SQL filtering on frontmatter fields.
Building the index
mdvs build (or mdvs init with auto-build) creates the search index in three steps: chunk, embed, store.
Chunking
Each file’s markdown body is split into semantic chunks — respecting headings, paragraphs, and code blocks rather than cutting at arbitrary character boundaries. The maximum chunk size is configurable (default 1024 characters) via the [chunking] section in mdvs.toml:
[chunking]
max_chunk_size = 1024
Each chunk tracks its start and end line numbers in the original file, so search results can point to the exact location.
Embedding
Chunks are embedded into dense vectors using a local Model2Vec model by Minish — static embeddings that run on CPU with no external services or GPU required. The model is downloaded from HuggingFace to the local cache on first use.
[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"
The default is potion-base-8M, a good balance of size and quality. The full POTION family:
| Model | Parameters | Notes |
|---|---|---|
minishlab/potion-base-2M | 2M | Smallest, fastest |
minishlab/potion-base-8M | 8M | Default — good balance |
minishlab/potion-base-32M | 32M | Higher quality, slower |
minishlab/potion-retrieval-32M | 32M | Optimized for retrieval tasks |
minishlab/potion-multilingual-128M | 128M | 101 languages |
Any Model2Vec-compatible model on HuggingFace works — set the name to its model ID. You can pin a specific revision for reproducibility.
Storage
A single Lance dataset is written to .mdvs/index.lance/ — one row per chunk, with everything you need on the same row:
| Column | Purpose |
|---|---|
chunk_id, file_id, chunk_index, start_line, end_line | Chunk identity and source location |
chunk_text | The plain-text chunk body — used by the full-text index and shown as the snippet in verbose results |
embedding | Dense vector for semantic search (FixedSizeList<Float32>) |
filepath, content_hash, built_at | Per-file metadata (duplicated on each of that file’s chunks) |
data | Frontmatter as an Arrow Struct (nested for dotted-name fields) — this is what --where filters query against |
Inside the dataset, two indexes are built at mdvs build time:
- A full-text BM25 index on
chunk_text, always built. - A cosine IVF-PQ vector index on
embedding, only built when the index has at least ~10,000 chunks. Smaller vaults use LanceDB’s exact flat scan, which is plenty fast at that scale.
Incremental builds
Build only re-embeds what changed. Each file’s markdown body (excluding frontmatter) is hashed, and the hash is compared against the existing index:
| Classification | Condition | Action |
|---|---|---|
| New | File not in index | Chunk, embed, add |
| Edited | Hash changed | Re-chunk, re-embed, replace chunks |
| Unchanged | Hash matches | Keep existing chunks |
| Removed | In index but not on disk | Drop file and its chunks |
Frontmatter-only changes (adding a tag, fixing a typo in author) rewrite the data column on every chunk row without re-embedding — the body hash hasn’t changed, so the vectors are still valid.
When nothing needs embedding, the model isn’t even loaded. A --force flag triggers a full rebuild regardless of hashes.
How search works
When you run mdvs search "query" example_kb, LanceDB does the heavy lifting. The shape of the work depends on --mode (default hybrid):
semantic— the query is embedded with the same model used during build, and chunks are ranked by cosine similarity againstembedding. Up to ~10,000 chunks, LanceDB does an exact flat scan; above that, the IVF-PQ vector index narrows the candidate set first.fulltext— the query is tokenized and scored against the BM25 full-text index onchunk_text. No model load needed.hybrid— both of the above run in parallel and their result lists are combined by LanceDB’s Reciprocal Rank Fusion reranker. Default mode because it tolerates queries that are either keyword-y or fuzzy.
For guidance on which mode to reach for, see Search Modes.
After LanceDB returns ranked chunk rows, mdvs deduplicates to the best chunk per file (a file with one highly relevant section ranks above a file with uniformly mediocre content) and then trims to --limit (default 10). LanceDB is asked for limit × 3 candidates to make sure dedupe has enough material to work with.
Scores
The score column in search output depends on the mode:
- Semantic — cosine similarity, a value in roughly
[0, 1](higher = more similar). - Fulltext — BM25 relevance score, unbounded above (higher = better match).
- Hybrid — RRF score, also unbounded above.
Scores depend on the mode, the model, and the content, so there’s no universal threshold for “relevant.” Compare scores relative to each other within a single query.
Filtering with --where
Add a SQL filter to narrow results by frontmatter fields:
mdvs search "calibration" example_kb --where "status = 'active'"
The --where clause filters on frontmatter fields — only chunks whose file matches the filter are included in the results. The filter and similarity ranking are combined in a single LanceDB query, so non-matching rows are excluded efficiently.
You can use any SQL expression that LanceDB’s filter supports:
--where "draft = false"
--where "status = 'active' AND author = 'Giulia Ferretti'"
--where "sample_count > 10"
Array fields, nested objects, and field names with special characters require specific syntax — see the Search Guide for the full reference.
Model identity
Search refuses to run if the model configured in mdvs.toml doesn’t match the model that was used to build the index. This is a hard error, not a warning.
Embeddings from different models are incompatible — cosine similarity between vectors from different models produces meaningless scores. If you change the model, rebuild the index with mdvs build --force.
Search Modes
mdvs search runs in one of three modes, controlled by --mode:
mdvs search "<query>" [path] --mode {semantic|fulltext|hybrid}
The default is hybrid. The right mode depends on what kind of question you’re asking and how confident you are about the wording.
TL;DR — which mode when
| You want to find… | Pick |
|---|---|
| Something whose wording you can paraphrase but not quote | semantic |
| An exact identifier, acronym, error message, or filename | fulltext |
| Anything — let mdvs combine both signals | hybrid (default) |
If you’re not sure, leave it on hybrid. It tends to do at least as well as either alone, at the cost of one extra index lookup that’s effectively free.
What each mode actually does
semantic — meaning, not words
The query is embedded into a vector with the same Model2Vec model used to build the index, and chunks are ranked by cosine similarity to that vector. Two chunks score similarly when they’re about similar things, even if they share no words.
This is the mode that does the magic:
mdvs search "how to get in touch" --mode semantic
# matches a chunk that says "reach out via Slack" with no shared words
It’s also the mode that has nothing to fall back on when your query is an acronym or a unique string that the model doesn’t have a meaningful embedding for.
fulltext — words, not meaning
The query is tokenized and scored against the BM25 inverted index on the persisted chunk_text column. No embedding model is loaded; this mode also works when no model has been downloaded yet.
Use it when you know the exact term you’re after:
mdvs search "SPR-A1" --mode fulltext # exact equipment ID
mdvs search "calibration.toml" --mode fulltext # exact filename
mdvs search "TODO-0159" --mode fulltext # exact ticket reference
BM25 doesn’t care about meaning at all. A search for "how to get in touch" in fulltext mode will only match chunks that contain some of those exact words.
hybrid — both, reranked
Hybrid runs both semantic and fulltext queries, then merges the two ranked lists with LanceDB’s Reciprocal Rank Fusion reranker. The result is a single ranking that promotes documents which scored well on either signal.
In practice this means:
- A natural-language query that has no exact lexical matches still ranks the semantically-closest chunks at the top.
- An exact-identifier query still surfaces the chunk that contains it verbatim, even if its surrounding context is semantically unremarkable.
- Queries that are both — a phrase that mixes a concept with a specific term — get the best of both rankings.
Hybrid is the default because it makes the system tolerate vague queries and precise queries with the same flag.
Scores aren’t comparable across modes
The score column in the output means something different in each mode:
| Mode | Score |
|---|---|
semantic | Cosine similarity. Roughly [0, 1]. Higher = more similar in meaning. |
fulltext | BM25 relevance score. Unbounded; depends on corpus size and term rarity. Higher = better lexical match. |
hybrid | RRF relevance score. Unbounded but small. Higher = better. |
Don’t compare scores across runs that used different modes. Within a single run, the ordering of the hits is what matters.
Performance and indexing
semanticneeds the embedding model loaded. On the first run that’s a one-time ~30 MB download (default model). Subsequent runs reuse the cached model.fulltextdoesn’t need the model at all and works as soon asmdvs buildhas been run.hybriddoes the semantic + fulltext work in parallel; the only extra cost oversemanticalone is the BM25 lookup, which is negligible at most vault sizes.
All three modes use the same Lance dataset under .mdvs/. The BM25 full-text index on chunk_text is built every time mdvs build runs; the cosine IVF-PQ vector index on embedding is built only when the index exceeds 10,000 chunks (smaller vaults rely on LanceDB’s exact flat scan, which is plenty fast at that scale). See Search & Indexing for the storage layout.
Combining with --where
Mode is independent of --where. Any mode can be paired with any SQL filter:
mdvs search "drift" --mode fulltext --where "status = 'active'"
mdvs search "how the project ended" --mode semantic --where "joined < '2025-01-01'"
mdvs search "calibration" --where "draft = false" # default mode is hybrid
The filter narrows which chunks LanceDB considers; the mode decides how they’re ranked within that narrowed set. See the Search Guide for the full --where reference.
Commands
mdvs provides eight commands covering the full workflow — from schema setup to search.
Schema & validation:
- init — Scan a directory, infer a typed schema, and write
mdvs.toml(or import via--from-jsonschema) - check — Validate frontmatter against the schema (optionally
--jsonschemato override) - update — Re-scan files, infer new fields, and update the schema
- export-jsonschema — Translate
mdvs.toml’s[fields]into a JSON Schema 2020-12 document
Search index:
Utilities:
init
Scan a directory, infer a typed schema, and write mdvs.toml.
Usage
mdvs init [path] [flags]
Flags
| Flag | Default | Description |
|---|---|---|
path | . | Directory to scan |
--glob | ** | Glob pattern for matching markdown files |
--force | Overwrite existing mdvs.toml | |
--dry-run | Preview the inferred schema without writing anything | |
--ignore-bare-files | Exclude files without YAML frontmatter | |
--skip-gitignore | Don’t read .gitignore patterns during scan | |
--from-jsonschema PATH | Import a JSON Schema file (.json or .toml) as the source of fields instead of scanning |
Global flags (-o, -v, --logs) are described in Configuration.
What it does
init scans every markdown file, extracts YAML frontmatter, infers a typed schema with path patterns, and writes mdvs.toml. It does not build the search index — run build or search for that.
See Getting Started for a full walkthrough with output, and Schema Inference for how types and path patterns are computed.
One artifact is created: mdvs.toml — the schema file. Commit this to version control.
If mdvs.toml or .mdvs/ already exists, init refuses to run unless you pass --force. With --force, both mdvs.toml and .mdvs/ are deleted before proceeding. To update an existing schema without overwriting it, use update instead.
init --force vs update reinfer
Both re-infer the schema from scratch, but they differ in scope:
init --forceoverwrites the entiremdvs.toml— all sections, including[scan],[fields], and any build sections. Any manual edits are lost..mdvs/is also deleted.update reinferre-infers only the[fields]section. All other config is preserved.
Output
Compact (default)
mdvs init example_kb
Each discovered field is shown as its own key-value table with the field name on the top border. Only a few fields are shown here — the full output includes all 43:
Initialized 43 files — 43 field(s)
┌ action_items ────────────┬───────────────────────────────────────────────────┐
│ type │ Array(String) │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 9 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ meetings/all-hands/** │
│ │ projects/alpha/meetings/** │
│ │ projects/beta/meetings/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ meetings/** │
│ │ projects/alpha/meetings/** │
│ │ projects/beta/meetings/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
...
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ type │ Float │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 3 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ true │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ projects/alpha/notes/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ projects/alpha/notes/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
...
┌ title ───────────────────┬───────────────────────────────────────────────────┐
│ type │ String │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 37 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ blog/** │
│ │ meetings/** │
│ │ people/** │
│ │ projects/** │
│ │ reference/protocols/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ blog/** │
│ │ meetings/** │
│ │ people/** │
│ │ projects/** │
│ │ reference/protocols/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
Initialized mdvs in 'example_kb'
Each table shows the inferred type, file count, nullable status, and inferred required/allowed glob patterns. Fields with special characters in their name (e.g., lab section) include a hints row with --where syntax advice (see Search Guide).
Verbose (-v)
Verbose output adds pipeline timing lines before the result:
mdvs init example_kb -v
Scan: 43 files (5ms)
Infer: 43 field(s) (0ms)
Write config: example_kb/mdvs.toml (0ms)
Initialized 43 files — 43 field(s)
┌ action_items ────────────┬───────────────────────────────────────────────────┐
│ type │ Array(String) │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 9 out of 43 │
...
The field tables are identical in both modes — verbose only adds the step lines showing processing times.
Examples
Preview the schema
Use --dry-run to see what init would infer without writing anything:
mdvs init example_kb --dry-run --force
Nothing is written — the output shows the same discovery table, followed by (dry run, nothing written).
Exclude bare files
By default, files without frontmatter are included in the scan. This affects field counts — a bare file at the root means title appears in 37 out of 43 files instead of 37 out of 37:
mdvs init example_kb --dry-run --force --ignore-bare-files
With --ignore-bare-files, only 37 files are scanned. The files row for title becomes 37 out of 37. This also affects the inferred required patterns — without bare files diluting the counts, more fields can be required in broader paths.
Import a JSON Schema (no scan)
--from-jsonschema PATH skips scanning and infers nothing. The file at PATH (.json or .toml) is the source of fields:
mdvs init example_kb --from-jsonschema fields.json
The schema is gated against mdvs’s supported keyword set before translation — unsupported features (oneOf, $ref, format, etc.) error out with an explanation. Path-scoping (allowed / required) and preprocessor stages are read from x-mdvs.* extension keys, so files exported via export-jsonschema round-trip losslessly.
The [scan], [embedding_model], [chunking], and [search] sections are not populated by this flow — the imported file only describes fields. Add build sections by hand or via a subsequent build.
Errors
| Error | Cause |
|---|---|
mdvs.toml already exists | Config exists and --force not passed |
is not a directory | Path doesn’t exist or isn’t a directory |
no markdown files found | No .md files match the glob pattern |
check
Validate frontmatter against the schema.
Usage
mdvs check [path]
Flags
| Flag | Default | Description |
|---|---|---|
path | . | Directory containing mdvs.toml |
--no-update | Skip auto-update before validating | |
--jsonschema PATH | Override the [fields] block of mdvs.toml with an external JSON Schema file (.json or .toml) for this run only |
Global flags (-o, -v, --logs) are described in Configuration.
What it does
check reads mdvs.toml, scans every markdown file, and validates each field value against the declared constraints.
By default, check auto-updates the schema before validating (see [check].auto_update). Use --no-update to skip this and validate against the current mdvs.toml as-is.
It reports seven kinds of violations:
WrongType— value doesn’t match the declaredtype(or fails apatternregex)Disallowed— field appears in a file whose path doesn’t match anyallowedglobMissingRequired— file matches arequiredglob but the field is absentNullNotAllowed— field isnullbutnullable = falseInvalidCategory— value is not in the field’s declaredcategories(see Constraints)OutOfRange— numeric value violatesmin/max, or length violatesmin_length/max_lengthFrontmatterUnrepresentable— file’s YAML can’t be represented as JSON (NaN/inf, non-string keys, non-object top-level)
Fields not in mdvs.toml (and not in the ignore list) are reported as new fields — these are informational and don’t count as violations.
check is read-only — it never modifies mdvs.toml or any files. See Validation for the full rules, including preprocessor handling and null behavior.
Validate against an external schema
--jsonschema PATH replaces the [fields] block for this run only. Useful for one-off validation against a contract, or for cross-checking a vault against someone else’s schema:
mdvs check example_kb --jsonschema partner-contract.json
mdvs.toml is not modified. If no mdvs.toml exists, a minimal config is synthesized in memory so the rest of the pipeline runs normally.
Output
Compact (default)
When everything passes:
mdvs check example_kb
Checked 43 files — no violations
When violations are found, each violation is shown as a key-value table with the field name, violation kind, the violated rule, and the affected files:
Checked 43 files — 3 violation(s)
Violations (3):
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ kind │ Null value not allowed │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ not nullable │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ projects/alpha/notes/experiment-2.md │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ priority ────────────────┬───────────────────────────────────────────────────┐
│ kind │ Wrong type │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ type Integer │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ projects/beta/notes/initial-findings.md (got Stri │
│ │ ng) │
│ │ projects/beta/overview.md (got String) │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ title ───────────────────┬───────────────────────────────────────────────────┐
│ kind │ Missing required │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ required in ["**"] │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ README.md │
│ │ lab-values.md │
│ │ reference/glossary.md │
│ │ reference/quick-start.md │
│ │ reference/tools.md │
│ │ scratch.md │
└──────────────────────────┴───────────────────────────────────────────────────┘
WrongType violations include the actual type in parentheses (e.g., got String).
Verbose (-v)
Verbose output adds pipeline timing lines before the result:
Read config: example_kb/mdvs.toml (3ms)
Scan: 43 files (2ms)
Validate: 43 files — 3 violation(s) (78ms)
Checked 43 files — 3 violation(s)
Violations (3):
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ kind │ Null value not allowed │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule │ not nullable │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ projects/alpha/notes/experiment-2.md │
└──────────────────────────┴───────────────────────────────────────────────────┘
...
The violation tables are identical in both modes — verbose only adds the step lines showing processing times.
Exit codes
| Code | Meaning |
|---|---|
0 | All files valid — no violations |
1 | Violations found |
2 | Pipeline error (missing mdvs.toml, invalid config, scan failure) |
New fields don’t affect the exit code — they’re informational only.
Errors
| Error | Cause |
|---|---|
no mdvs.toml found | Config doesn’t exist — run mdvs init first |
mdvs.toml is invalid | TOML parsing or schema error — fix the file or run mdvs init --force |
update
Re-scan files, infer new fields, and update the schema.
Usage
mdvs update [path] [--dry-run]
mdvs update [path] reinfer [fields..] [flags]
Flags
| Flag | Default | Description |
|---|---|---|
path | . | Directory containing mdvs.toml |
--dry-run | Preview changes without writing anything |
Global flags (-o, -v, --logs) are described in Configuration.
What it does
update re-scans the directory using the existing [scan] config, infers types and path patterns from the current files, and merges the results into mdvs.toml. Unlike init, it preserves all existing configuration — only the [fields] section changes.
Default mode
By default, update only discovers new fields — fields that appear in frontmatter but aren’t yet in mdvs.toml (either as [[fields.field]] entries or in the ignore list). Existing fields are protected: their types, allowed/required patterns, nullable flags, and constraints don’t change.
Fields that disappear (no longer in any file) are kept in mdvs.toml by default. This is conservative — removing a field from the schema is an explicit action.
reinfer subcommand
Re-infer field definitions from scratch. This is a subcommand of update with its own flags:
| Flag | Description |
|---|---|
fields.. | Fields to reinfer (all if none specified) |
--with <kinds> | Comma-separated constraint kinds to apply (categorical, range, none). Requires named fields. |
--max-categories <N> | Override max distinct values for categorical inference |
--min-repetition <N> | Override min average repetition for categorical inference |
--dry-run | Preview changes without writing anything |
Reinfer specific fields:
mdvs update example_kb reinfer drift_rate priority
The named fields are removed from mdvs.toml and re-inferred from scratch, as if they’d never been seen. All other fields stay protected. Fails if a named field isn’t in mdvs.toml.
Without --with, reinfer applies the default heuristic (categorical detection — see Constraints). Use --with to override:
# Force categorical (skip heuristic threshold)
mdvs update example_kb reinfer title --with=categorical
# Infer min/max from observed numeric values
mdvs update example_kb reinfer sample_count --with=range
# Strip all constraints
mdvs update example_kb reinfer status --with=none
--with takes a comma-separated list. Incompatible kinds (e.g., range,categorical on the same field) are rejected at parse time. --with=none cannot be combined with other kinds. --with requires named fields.
Reinfer all fields:
mdvs update example_kb reinfer
When no fields are specified, all [[fields.field]] entries are removed and rebuilt from the current files. Fields that no longer exist in any file are reported as removed.
All other config sections ([scan], [embedding_model], [chunking], [search], [update]) are preserved. This is the key difference from init --force, which rewrites the entire mdvs.toml.
Output
Compact (default)
When the schema is already up to date:
Scanned 43 files — no changes (37 unchanged) (dry run)
When new fields are discovered, they appear in an “Added” section with the same key-value format as init:
Scanned 44 files — 1 field(s) changed (37 unchanged) (dry run)
Added (1):
┌ category ────────────────┬───────────────────────────────────────────────────┐
│ type │ String │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 3 out of 44 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ (none) │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ projects/alpha/notes/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
When reinfer detects a type change, the “Changed” section shows old and new values with an arrow:
Scanned 43 files — 1 field(s) changed (36 unchanged)
Changed (1):
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ type │ Float → String │
└──────────────────────────┴───────────────────────────────────────────────────┘
When a reinferred field no longer exists in any file:
Scanned 43 files — 1 field(s) changed (36 unchanged)
Removed (1):
┌ category ────────────────┬───────────────────────────────────────────────────┐
│ previously allowed │ projects/alpha/notes/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
Verbose (-v)
Verbose output adds pipeline timing lines before the result:
Read config: example_kb/mdvs.toml (2ms)
Scan: 44 files (3ms)
Infer: 38 field(s) (0ms)
Write config: example_kb/mdvs.toml (1ms)
Scanned 44 files — 1 field(s) changed (37 unchanged)
Added (1):
┌ category ────────────────┬───────────────────────────────────────────────────┐
│ type │ String │
...
The field tables are identical in both modes — verbose only adds the step lines showing processing times.
Exit codes
| Code | Meaning |
|---|---|
0 | Success (changes written, or no changes needed) |
2 | Pipeline error (missing config, scan failure, build failure) |
Errors
| Error | Cause |
|---|---|
no mdvs.toml found | Config doesn’t exist — run mdvs init first |
field '<name>' is not in mdvs.toml | reinfer names a field that doesn’t exist |
--with requires named fields | --with flag used without specifying fields |
--with: <X> and <Y> are mutually exclusive | Incompatible constraint kinds in the same --with list |
--with=none cannot be combined with other kinds | none mixed with other kinds in --with |
field name conflicts with internal column | New field name collides with reserved names |
build
Validate, embed, and write the search index.
Usage
mdvs build [path] [flags]
Flags
| Flag | Default | Description |
|---|---|---|
path | . | Directory containing mdvs.toml |
--set-model | Change embedding model (requires --force) | |
--set-revision | Pin model to a specific HuggingFace revision (requires --force) | |
--set-chunk-size | Change max chunk size in characters (requires --force) | |
--force | Confirm config changes or trigger a full rebuild | |
--no-update | Skip auto-update before building |
Global flags (-o, -v, --logs) are described in Configuration.
What it does
build creates (or updates) the search index in .mdvs/. The pipeline:
- Read config — parse
mdvs.toml. If[embedding_model],[chunking], or[search]sections are missing, they’re added with defaults and written back.
By default, build auto-updates the schema before building (see [build].auto_update). Use --no-update to skip this.
- Scan — walk the directory and extract frontmatter.
- Validate — check frontmatter against the schema (same as check). If violations are found, the build aborts.
- Classify — compare scanned files against the existing index to determine what needs embedding.
- Load model — download or load the cached embedding model. Skipped if nothing needs embedding.
- Embed — chunk and embed new/edited files.
- Write index — write the Lance dataset at
.mdvs/index.lance/(one row per chunk) and create indexes inside it: a full-text BM25 index onchunk_text(always) and a cosine IVF-PQ vector index onembedding(only above 10,000 chunks; smaller vaults rely on LanceDB’s exact flat scan).
See Search & Indexing for details on chunking, embedding, and how the index is structured.
Incremental builds
Build is incremental by default. It classifies each file by comparing its content hash against the existing index:
| Status | Condition | Action |
|---|---|---|
| new | file not in existing index | chunk + embed |
| edited | file in index, content changed | chunk + re-embed |
| unchanged | file in index, content matches | keep existing chunks |
| removed | file in index, no longer on disk | drop from index |
Content hash covers the file body only (after frontmatter extraction). Frontmatter-only changes don’t trigger re-embedding — but every chunk row is rewritten with fresh frontmatter from the current scan.
When nothing needs embedding, the model is never loaded.
Config changes
build detects when the embedding configuration has changed since the last build by comparing mdvs.toml against metadata stored on the Lance dataset. If a mismatch is found, the build refuses to proceed unless you pass --force:
config changed since last build:
model: 'minishlab/potion-base-8M' → 'minishlab/potion-base-32M'
Use --force to rebuild with new config
The same check covers schema changes. A hash of the post-translation JSON Schema is stored on the Lance dataset; if the current schema doesn’t match, the build refuses with:
schema: fields, types, constraints, path-scoping, or preprocessors have changed
Use --force to rebuild with new schema
This catches edits to [[fields.field]] definitions, constraint changes, preprocessor changes, and path-scoping changes — anything that affects what gets stored in the data column of the index.
The --set-model, --set-revision, and --set-chunk-size flags update mdvs.toml and require --force (since they change the config and trigger a full re-embed). For example, to switch to a larger model:
mdvs build --set-model minishlab/potion-base-32M --force
--set-revision pins the model to a specific HuggingFace commit SHA, ensuring reproducible embeddings even if the model is updated upstream:
mdvs build --set-revision abc123def --force
The revision is stored in mdvs.toml under [embedding_model].revision and checked against the Lance dataset metadata on subsequent builds. See Embedding for the full list of available models.
On the first build (no existing .mdvs/), --force is never needed.
Output
Compact (default)
When nothing needs embedding (incremental build, all files unchanged):
Built index — 43 files, 59 chunks
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ full rebuild │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files total │ 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files embedded │ 0 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files unchanged │ 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed │ 0 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks total │ 59 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks embedded │ 0 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks unchanged │ 59 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks removed │ 0 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ new fields │ (none) │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ embedded files │ (none) │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ removed files │ (none) │
└──────────────────────────┴───────────────────────────────────────────────────┘
When violations are found, the build aborts:
Build aborted — 6 violation(s) found. Run `mdvs check` for details.
Verbose (-v)
Verbose output adds pipeline timing lines before the result:
Read config: example_kb/mdvs.toml (4ms)
Scan: 43 files (4ms)
Infer: 37 field(s) (0ms)
Validate: 43 files — no violations (87ms)
Classify: 43 files (full rebuild) (0ms)
Load model: minishlab/potion-base-8M (24ms)
Embed: 43 files, 59 chunks (12ms)
Write index: 43 files, 59 chunks (1ms)
Built index — 43 files, 59 chunks (full rebuild)
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ full rebuild │ true │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files total │ 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files embedded │ 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files unchanged │ 0 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ ... │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ embedded files │ README.md (7 chunks) │
│ │ blog/drafts/grant-ideas.md (2 chunks) │
│ │ ... │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ removed files │ (none) │
└──────────────────────────┴───────────────────────────────────────────────────┘
The key-value table is identical in both modes — verbose only adds the step lines showing processing times. When files are embedded, the embedded files row lists each file with its chunk count.
Exit codes
| Code | Meaning |
|---|---|
0 | Build completed successfully |
1 | Violations found — build aborted |
2 | Pipeline error (missing config, scan failure, config mismatch, model failure) |
Errors
| Error | Cause |
|---|---|
no mdvs.toml found | Config doesn’t exist — run mdvs init first |
config changed since last build | Config differs from Lance dataset metadata — use --force |
--set-model requires --force | Changing model triggers full re-embed |
--set-chunk-size requires --force | Changing chunk size triggers full re-embed |
dimension mismatch | Model produces different dimensions than existing index (incremental build only — --force bypasses this) |
search
Query the index with natural language.
Usage
mdvs search <query> [path] [flags]
Flags
| Flag | Default | Description |
|---|---|---|
query | (required) | Natural language search query |
path | . | Directory containing mdvs.toml |
--mode | hybrid | Search mode: semantic, fulltext, or hybrid |
--limit / -n | 10 | Maximum number of results |
--where | SQL WHERE clause for filtering on frontmatter fields | |
--no-update | Skip auto-update | |
--no-build | Skip auto-build before searching |
The default limit can be changed in mdvs.toml via [search].default_limit.
Global flags (-o, -v, --logs) are described in Configuration.
What it does
search loads the Lance index from .mdvs/, runs the query through LanceDB, and ranks files by their best-matching chunk. The exact ranking depends on --mode:
semantic— embed the query with the same model that built the index, cosine-rank chunks againstembedding.fulltext— BM25 rank chunks against the persistedchunk_text(no model load needed).hybrid(default) — run both and combine with LanceDB’s Reciprocal Rank Fusion reranker.
Each file’s score is the best chunk match across all its chunks (see scoring). Results are sorted descending (higher = better match).
By default, search auto-builds the index before querying, which includes auto-updating the schema (see [search].auto_build). Use --no-build to query the existing index as-is, or --no-update to build without updating the schema first.
See Search & Indexing for details on chunking, embedding, scoring, and model identity.
First run
Note: The very first time
search(orbuild) runs, mdvs downloads the embedding model from HuggingFace to a local cache. This is a one-time download — subsequent runs use the cached model and start instantly.Download size depends on the model:
Model Size potion-base-2M~8 MB potion-base-8M(default)~30 MB potion-base-32M~120 MB potion-multilingual-128M~480 MB After the model is cached, a full build of 500+ files completes in under a second.
--where
Filter results by frontmatter fields using SQL syntax. The filter and similarity ranking are combined in a single query, so files that don’t match are excluded efficiently.
Scalar comparisons:
mdvs search "experiment" --where "status = 'active'"
mdvs search "experiment" --where "sample_count > 20"
mdvs search "experiment" --where "status = 'active' AND priority = 1"
Array fields (via LanceDB’s SQL array functions):
mdvs search "calibration" --where "array_has(tags, 'biosensor')"
--where clauses that reference Array(Float) fields are rejected up front with a clear error. See the Search Guide for the full explanation and the workaround.
Field names with spaces need double-quoting:
mdvs search "query" --where "\"lab section\" = 'optics'"
See Search Guide for the full --where reference, including nested objects, escaping rules, and more examples.
Output
Compact (default)
mdvs search "experiment" example_kb -n 3
A header table shows the query metadata, followed by one key-value table per hit numbered #1, #2, etc. Each hit includes the file, similarity score, line range, and the best-matching chunk text:
Searched "experiment" — 3 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ experiment │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 3 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/archived/gamma/lessons-learned.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.487 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines │ 26-28 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text │ ## On REMO │
│ │ │
│ │ REMO's environmental monitoring data from the out │
│ │ door tests was the most useful output of the enti │
│ │ re project. ... │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #2 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ blog/published/2031/founding-story.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.470 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines │ 21-21 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text │ We are a small lab and we intend to stay small... │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #3 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/archived/gamma/post-mortem.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.457 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines │ 11-21 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text │ # Project Gamma — Post-Mortem ... │
└──────────────────────────┴───────────────────────────────────────────────────┘
With --where filtering, only files matching the SQL clause are included:
mdvs search "experiment" example_kb --where "status = 'active'" -n 5
Searched "experiment" — 3 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ experiment │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 5 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/alpha/overview.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.391 │
...
Verbose (-v)
Verbose output adds pipeline timing lines before the result:
mdvs search "experiment" example_kb -v -n 3
Read config: example_kb/mdvs.toml (2ms)
Scan: 43 files (2ms)
...
Load model: minishlab/potion-base-8M (22ms)
Embed query: "experiment" (0ms)
Execute search: 3 hits (5ms)
Searched "experiment" — 3 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ experiment │
...
The hit tables are identical in both modes — verbose only adds the step lines showing processing times.
Exit codes
| Code | Meaning |
|---|---|
0 | Search completed (even with 0 results) |
2 | Pipeline error (missing config, missing index, model mismatch, invalid --where) |
Errors
| Error | Cause |
|---|---|
no mdvs.toml found | Config doesn’t exist — run mdvs init first |
index not found | .mdvs/ doesn’t exist — run mdvs build first |
model mismatch | Config model differs from index — run mdvs build to rebuild |
Invalid --where | SQL syntax error or unknown field name |
info
Show config and index status.
Usage
mdvs info [path]
Flags
| Flag | Default | Description |
|---|---|---|
path | . | Directory containing mdvs.toml |
Global flags (-o, -v, --logs) are described in Configuration.
What it does
info reads mdvs.toml, counts files on disk, and reads the index metadata from .mdvs/ (if it exists). It displays the current schema and index status without modifying anything.
Use it to check which fields are configured, whether the index is up to date, or if the config has changed since the last build.
Output
Compact (default)
mdvs info example_kb
The output is organized into sections: Config, Index (if built), and one key-value table per field. Only a few fields are shown here:
43 files, 43 fields, 59 chunks
Config:
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ scan glob │ ** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ ignored fields │ (none) │
└──────────────────────────┴───────────────────────────────────────────────────┘
Index:
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ revision │ none │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunk size │ 1024 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ built │ 2026-03-29T15:22:21.347671+00:00 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ config │ match │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 43 out of 43 │
└──────────────────────────┴───────────────────────────────────────────────────┘
43 fields:
┌ action_items ────────────┬───────────────────────────────────────────────────┐
│ type │ Array(String) │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 9 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ meetings/all-hands/** │
│ │ projects/alpha/meetings/** │
│ │ projects/beta/meetings/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ meetings/** │
│ │ projects/alpha/meetings/** │
│ │ projects/beta/meetings/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
...
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ type │ Float │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files │ 3 out of 43 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable │ true │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required │ projects/alpha/notes/** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed │ projects/alpha/notes/** │
└──────────────────────────┴───────────────────────────────────────────────────┘
...
The config row shows match when mdvs.toml matches the index metadata, or changed when the config has been modified since the last build. The files row shows indexed files vs files on disk.
When no index has been built:
43 files, 43 fields
Config:
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ scan glob │ ** │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ ignored fields │ (none) │
└──────────────────────────┴───────────────────────────────────────────────────┘
43 fields:
...
The Index section is omitted and the summary shows only files and fields (no chunk count).
Verbose (-v)
Verbose output adds pipeline timing lines before the result:
Read config: example_kb/mdvs.toml (2ms)
Scan: 43 files (3ms)
Read index: 43 files, 59 chunks (2ms)
43 files, 43 fields, 59 chunks
Config:
...
The tables are identical in both modes — verbose only adds the step lines showing processing times.
Exit codes
| Code | Meaning |
|---|---|
0 | Success (including when no index exists) |
2 | Pipeline error (missing config, Lance dataset read failure) |
Errors
| Error | Cause |
|---|---|
no mdvs.toml found | Config doesn’t exist — run mdvs init first |
clean
Delete the search index.
Usage
mdvs clean [path]
Flags
| Flag | Default | Description |
|---|---|---|
path | . | Directory containing mdvs.toml |
Global flags (-o, -v, --logs) are described in Configuration.
What it does
clean deletes the .mdvs/ directory, which contains the Lance dataset that makes up the search index (plus the cached embedding model). The mdvs.toml configuration file is never touched — you can rebuild the index at any time with build.
The command is idempotent — running it when .mdvs/ doesn’t exist is a no-op. It also refuses to delete if .mdvs/ is a symlink, as a safety measure.
Output
Compact (default)
mdvs clean example_kb
Cleaned "example_kb/.mdvs"
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ removed │ true │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ path │ example_kb/.mdvs │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed │ 2 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ size │ 113.7 KB │
└──────────────────────────┴───────────────────────────────────────────────────┘
When there’s nothing to clean:
Nothing to clean — "example_kb/.mdvs" does not exist
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ removed │ false │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ path │ example_kb/.mdvs │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed │ 0 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ size │ 0 B │
└──────────────────────────┴───────────────────────────────────────────────────┘
Verbose (-v)
Verbose output adds pipeline timing lines before the result:
Delete index: example_kb/.mdvs (2 files, 113.8 KB) (0ms)
Cleaned "example_kb/.mdvs"
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ removed │ true │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ path │ example_kb/.mdvs │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed │ 2 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ size │ 113.8 KB │
└──────────────────────────┴───────────────────────────────────────────────────┘
Exit codes
| Code | Meaning |
|---|---|
0 | Success (including when nothing to clean) |
2 | Pipeline error (symlink detected, I/O failure) |
Errors
| Error | Cause |
|---|---|
.mdvs is a symlink | Refuses to delete symlinks for safety — remove it manually |
export-jsonschema
Translate mdvs.toml’s [fields] block into a JSON Schema 2020-12 document.
Usage
mdvs export-jsonschema [path] [flags]
Flags
| Flag | Default | Description |
|---|---|---|
path | . | Directory containing mdvs.toml |
--format json|toml | json | Output format. toml produces a TOML serialization of the same JSON Schema |
--output-file FILE | (stdout) | Write to a file instead of stdout |
Global flags (-o, -v, --logs) are described in Configuration.
What it does
export-jsonschema reads mdvs.toml, takes the [fields] block, and translates it into a JSON Schema 2020-12 document. Field types, constraints, path-scoping, and preprocessor stages are all preserved. The output is a valid JSON Schema that any standards-compliant validator can consume.
Build configuration ([embedding_model], [chunking], [search]) and scan settings are not included — JSON Schema only describes the field contract.
Round-tripping with init
export-jsonschema and init --from-jsonschema are designed to round-trip losslessly:
mdvs export-jsonschema ./project --output-file fields.json
mdvs init ./reborn --from-jsonschema fields.json
The new mdvs.toml reproduces the original [[fields.field]] definitions:
- Field types (strict —
StringisString, not a permissive set) - Constraints (
categories,min/max,min_length/max_length,pattern) - Path-scoping (
allowed,required) - Preprocessor arrays (
preprocess = ["coerce-to-string"], etc.) - The
[fields].ignorelist
mdvs-specific metadata that JSON Schema 2020-12 doesn’t model is carried in x-mdvs.* extension keys; generic JSON Schema validators ignore them, and init --from-jsonschema reads them back.
Examples
Export to stdout (pipeable)
mdvs export-jsonschema example_kb | jq '.properties | keys'
When writing to stdout, the summary banner is suppressed so the output is directly pipeable.
Export to a file
mdvs export-jsonschema example_kb --output-file fields.json
Exported schema for 37 field(s) → fields.json (json)
Export as TOML
mdvs export-jsonschema example_kb --format toml --output-file fields.toml
The TOML output is the same JSON Schema, serialized via the workspace tomljson crate. It’s interchangeable with the JSON form — init --from-jsonschema fields.toml produces the same result as the JSON file.
Errors
| Error | Cause |
|---|---|
no mdvs.toml found | Config doesn’t exist — run mdvs init first |
mdvs.toml is invalid | TOML parsing or schema error |
failed to write | Output file path is not writable |
Recipes
Walkthroughs for pointing mdvs at common markdown ecosystems.
- Obsidian — YAML-frontmatter vaults,
.mdvsignorepatterns, Dataview caveats, common validation setups - Hugo — Mixed-format sites (YAML / TOML / JSON), native TOML date queries, forced-format mode for opinionated repos
- CI — Running
mdvs checkin a pipeline as a frontmatter linter
Obsidian
mdvs works well with Obsidian vaults — it validates your YAML frontmatter for consistency and provides semantic search across all your notes. Everything runs locally, no external services needed. (Obsidian emits YAML; mdvs also handles TOML and JSON if you’ve imported notes from other tools — see the Hugo recipe for the mixed-format case.)
Setup
Point mdvs at your vault:
mdvs init path/to/vault
This scans all markdown files, infers a typed schema from your frontmatter, and writes mdvs.toml. If auto-build is enabled (the default), it also downloads the embedding model and builds the search index.
Two artifacts are created:
mdvs.toml— commit this to version control.mdvs/— add to.gitignore(search index, can be rebuilt)
.gitignore
mdvs respects .gitignore by default. If your vault has .obsidian/ in .gitignore (many do), those files are automatically excluded from scanning. No extra configuration needed.
.mdvsignore
For additional exclusions, create a .mdvsignore file at the vault root. It uses the same syntax as .gitignore:
# AI working directories
.claude/
.gemini/
# Template files (if using Templater)
_templates/
# Attachments (no frontmatter)
attachments/
assets/
Any directory that doesn’t contain markdown with frontmatter is a good candidate for exclusion — it speeds up scanning and avoids noise in the schema.
Common frontmatter patterns
Obsidian vaults typically use frontmatter like:
---
title: My Note
tags: [project, research]
status: active
date: 2026-03-14
draft: false
---
mdvs infers types automatically:
| Field | Inferred type | Notes |
|---|---|---|
title | String | |
tags | Array(String) | Array of strings |
status | String | |
date | String | No Date type yet — dates are stored as strings |
draft | Boolean |
Inconsistent types
If the same field has different types across notes (e.g., priority is an integer in some files and a string like "high" in others), mdvs widens to the broadest compatible type — usually String. See Types & Widening for the full rules.
Dataview fields
If you use the Dataview plugin, its inline fields (e.g., key:: value) are not picked up by mdvs — only YAML frontmatter between --- fences is scanned. Dataview fields that appear in the YAML block are handled normally.
Validation
Once mdvs.toml exists, use check to verify your frontmatter:
mdvs check path/to/vault
This catches:
- Wrong types — a Boolean field with a string value
- Missing required fields — a field that should be present in certain directories
- Disallowed fields — a field appearing where it shouldn’t
- Null violations — null where it’s not allowed
See Validation for the full rules.
Tightening constraints
The inferred schema is permissive by default. To enforce stricter rules, edit mdvs.toml directly. For example, to require tags in all daily notes:
[[fields.field]]
name = "tags"
type = "Array(String)"
allowed = ["**"]
required = ["daily/**"]
nullable = false
Updating the schema
When you introduce new frontmatter fields, run update to incorporate them:
mdvs update path/to/vault
This discovers new fields and adds them to mdvs.toml without touching existing field definitions. Use the reinfer subcommand to re-infer specific fields if you’ve reorganized your vault.
Search
Build the index and search:
mdvs build path/to/vault
mdvs search "topic of interest" path/to/vault
Filter with --where on your frontmatter:
# Only active notes
mdvs search "topic" path/to/vault --where "status = 'active'"
# Notes with a specific tag
mdvs search "topic" path/to/vault --where "array_has(tags, 'research')"
# Notes in a specific directory
mdvs search "topic" path/to/vault --where "filepath LIKE 'projects/%'"
See the Search Guide for the full --where reference.
Tips
-
Incremental builds — only notes whose body changed since the last build are re-embedded. Frontmatter-only changes (updating tags, status) don’t trigger re-embedding. Run
mdvs buildfreely — it’s fast when nothing changed. -
Alongside Obsidian search — mdvs search is semantic (finds conceptually related notes), while Obsidian’s built-in search is keyword-based. They complement each other.
-
Large vaults — mdvs has been tested on vaults with 500+ files and 2000+ chunks. A full build from scratch completes in under a second. Subsequent builds are incremental, re-embedding only changed files.
-
Ignore noisy fields — if some frontmatter fields are auto-generated and you don’t want to validate them, add them to the
ignorelist inmdvs.toml:[fields] ignore = ["cssclass", "kanban-plugin"]
Hugo
mdvs works directly on a Hugo site’s content/ tree. Hugo accepts YAML (---), TOML (+++), and JSON ({...}) frontmatter; mdvs accepts the same three formats and auto-detects per file, so it doesn’t matter which convention your site uses — or whether you’ve drifted across formats over time.
Setup
Point mdvs at the content/ directory:
mdvs init path/to/site/content
This scans every markdown file, infers a typed schema from the frontmatter (across all three formats), and writes mdvs.toml alongside. If auto-build is enabled (the default), it also downloads the embedding model and builds the search index under .mdvs/.
Two artifacts are created next to content/:
mdvs.toml— commit to version control.mdvs/— add to.gitignore(search index, regenerable)
Some Hugo sites prefer to keep the schema and index alongside the site root rather than inside content/. In that case, run mdvs init . from the site root and use a glob:
[scan]
glob = "content/**"
Mixed-format vaults
Hugo’s docs show all three frontmatter formats interchangeably, and real-world sites often end up with a mix — an older --- post sitting next to a newer +++ post and an occasional {...} block emitted by a content tool. mdvs handles this transparently. A single mdvs.toml is inferred across all three formats; the same title, tags, draft fields collapse into one schema regardless of where they were written.
You can verify this with mdvs check after init:
$ mdvs check
Checked 142 files — no violations
Forcing a single format
If your site is opinionated about TOML (Hugo’s default for hugo new), tell mdvs:
[scan]
frontmatter_format = "toml"
Now any file that uses --- (YAML) or { (JSON) raises a FrontmatterUnrepresentable error during check, naming both the configured and detected delimiters. Useful when you want your CI to fail loudly if someone drops in a YAML post by accident.
Native TOML dates
Hugo’s TOML frontmatter often uses native Date / DateTime literals — unquoted, e.g.:
+++
title = "Launching v2"
date = 2024-09-01
publishedAt = 2024-09-01T09:00:00Z
+++
mdvs recognizes both as typed fields: date becomes FieldType::Date, publishedAt becomes FieldType::DateTime. No special configuration. You can then filter on them in search:
mdvs search "release notes" --where "publishedAt > '2024-01-01T00:00:00Z'"
Useful queries
Once the index is built, common Hugo-site workflows become one-liners:
Find drafts that have been sitting around:
mdvs search "" --where "draft = true" --output json
Posts in a particular taxonomy:
mdvs search "machine learning roundup" --where "'ml' = ANY(tags)"
Posts authored by a specific contributor in a date range:
mdvs search "authentication" \
--where "author = 'alice' AND date >= '2024-01-01' AND date < '2024-04-01'"
The --where clause is SQL against your frontmatter — anything you can express as a column reference works. See the Search Guide for details.
Validating across an editorial workflow
Add mdvs check to your Hugo build pipeline so frontmatter drift fails CI:
# .github/workflows/build.yml (excerpt)
- name: Validate frontmatter
run: mdvs check
- name: Build site
run: hugo --minify
mdvs check returns exit code 1 if any file violates the schema (missing required field, wrong type, etc.), which is enough to break the build. The exact same mdvs.toml validates YAML, TOML, and JSON files uniformly — no per-format duplicate rules.
See the CI recipe for a more general-purpose CI workflow.
CI
mdvs check exits with code 1 when any file violates the schema, so it slots straight into a CI pipeline as a frontmatter linter. This page covers the GitHub Actions case, but the same shape works on GitLab CI, CircleCI, or any runner that can install a binary and run a command.
Minimal GitHub Actions workflow
# .github/workflows/check-frontmatter.yml
name: Frontmatter check
on:
push:
branches: [main]
pull_request:
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install mdvs
env:
MDVS_VERSION: vX.Y.Z # pin to a specific release — see below
run: |
curl --proto '=https' --tlsv1.2 -LsSf \
"https://github.com/edochi/mdvs/releases/download/${MDVS_VERSION}/mdvs-installer.sh" | sh
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
- name: Validate frontmatter
run: mdvs check --no-update
Replace vX.Y.Z with a real release tag (see the releases page). This adds a check that runs on every PR and every push to main. If a contributor introduces a file with a wrong type, missing required field, disallowed field, or unrepresentable frontmatter, the job fails and the PR is blocked until it’s fixed.
Pin the mdvs version
The installer URL above pulls a specific release tag. GitHub also exposes a releases/latest/download/... URL that always redirects to the newest release — convenient for casual use, and that’s what the README install snippet uses — but in CI you want reproducibility. Pinning a specific tag means a green check today still passes (or still fails the same way) tomorrow, regardless of what mdvs ships in the meantime.
Bump the pinned version when you’re ready to adopt new validation behavior. The mdvs release notes call out anything that affects validation output.
--no-update for deterministic CI
The --no-update flag (or [check].auto_update = false in mdvs.toml) tells check to validate strictly against the committed schema instead of re-running inference first. This matters in CI:
- With auto-update on: a PR that adds a new frontmatter field will pass because
checkre-infers the schema and silently includes the new field. The unintended addition slips through. - With
--no-update: the same PR fails with aDisallowedviolation because the new field isn’t in the committedmdvs.toml. The contributor has to either remove the field, add it to the schema deliberately, or add it to theignorelist — all of which surface the decision.
In practice this means: in CI, always use --no-update. Run mdvs update locally when you want to add new fields, commit the resulting mdvs.toml, and the CI run will then pass.
Caching the install
The installer step downloads a small binary (~6 MB on Linux) and finishes in well under a second. There’s usually no point caching it. If you want to avoid the network call entirely on every run, use actions/cache keyed on the mdvs version string, or commit a vendored binary into the repo and skip the install step.
What check does (and doesn’t)
mdvs check covers frontmatter validation only:
- ✓ Wrong types (a
Booleanfield with a string value) - ✓ Missing required fields per directory
- ✓ Disallowed fields (anything not in
mdvs.tomland not inignore) - ✓ Null violations
- ✓ Category, length, range, and regex constraint violations
- ✓ Frontmatter that can’t be parsed at all (broken YAML, broken TOML, broken JSON)
It does not check spelling, link validity, markdown style, or anything in the body content. Pair it with a markdown linter (markdownlint, vale) for those concerns. They run independently and have no conflict — mdvs check and a body-content linter cover orthogonal parts of the file.
Other CI systems
The shape translates directly:
- GitLab CI: the same two-step install-then-run pattern in
.gitlab-ci.yml. Use the install script underbefore_script:and runmdvs check --no-updatein the job. - CircleCI: an
orbor a custom step that installs the binary and invokes the check. - Pre-commit hook:
mdvs check --no-updateas a hook entry in.pre-commit-config.yamlruns the check locally on every commit, catching issues before they reach CI.
The contract is always the same: install mdvs, run mdvs check --no-update, fail on non-zero exit.
Configuration
All configuration lives in mdvs.toml, created by init and updated by update. This page is a complete reference of every section and field.
Sections overview
mdvs.toml has two groups of sections:
Validation (always present):
[scan]— file discovery[check]— check command settings[fields]— field definitions and ignore list
Build & search (written by init, model/chunking filled by first build):
[embedding_model]— model identity[chunking]— chunk sizing[build]— build workflow settings[search]— search defaults and auto-build/update
Global flags
These flags apply to all commands:
| Flag | Values | Default | Description |
|---|---|---|---|
-o, --output | text, json | text | Output format |
-v, --verbose | Show detailed output (pipeline steps, expanded records) | ||
--logs | info, debug, trace | (none) | Enable diagnostic logging to stderr |
[scan]
Controls how markdown files are discovered.
[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
frontmatter_format = "auto"
| Field | Type | Default | Description |
|---|---|---|---|
glob | String | "**" | Glob pattern for matching markdown files |
include_bare_files | Boolean | true | Include files without frontmatter |
skip_gitignore | Boolean | false | Don’t read .gitignore patterns during scan |
frontmatter_format | String | "auto" | Which frontmatter format(s) to accept — see Frontmatter format |
When include_bare_files is true, files without frontmatter participate in inference (empty field set) and validation (can trigger MissingRequired). When false, they’re excluded from the scan entirely.
Frontmatter format
mdvs accepts YAML, TOML, and JSON frontmatter. The frontmatter_format field takes one of four values:
| Value | Behavior |
|---|---|
"auto" (default) | Detect per file from the opening delimiter. See the probe table below. |
"yaml" | Parse every file as YAML; reject +++ or {-opened files with a clear error. |
"toml" | Parse every file as TOML; reject --- or {-opened files. |
"json" | Parse every file as JSON; reject --- or +++-opened files. |
In auto mode (the default), mdvs reads the first non-empty line of each file to pick the engine:
| First non-empty line of a file | Format used |
|---|---|
--- | YAML |
+++ | TOML |
starts with { | JSON (Hugo convention — the braces are part of the JSON object) |
| anything else | treated as a bare file (no frontmatter) |
The probe is one line per file. A single vault can mix all three formats freely.
The forced modes ("yaml" / "toml" / "json") skip the probe and assume every scanned file uses that format. Files whose actual leading delimiter belongs to a different format produce a FrontmatterUnrepresentable error naming both the configured and detected formats. This is useful for opinionated repos (e.g., a Hugo site committed to TOML that wants mdvs check to fail loudly if someone slips in a --- file).
Naming note. frontmatter_format = "toml" controls how mdvs parses frontmatter in .md files. It has nothing to do with mdvs.toml itself — mdvs.toml is always TOML because it’s a config file. Two unrelated uses of “TOML” in the project.
[update]
Placeholder for future update-specific settings. Currently empty — this section is hidden from mdvs.toml by default.
[check]
Check command settings.
[check]
auto_update = true
| Field | Type | Default | Description |
|---|---|---|---|
auto_update | Boolean | false | Auto-run update before validating |
When auto_update is true, check runs the update pipeline (scan, infer, write config) before validating. Set to false or use --no-update for deterministic CI validation against the committed mdvs.toml.
[embedding_model]
Specifies the embedding model for semantic search. See Embedding for available models.
[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"
| Field | Type | Default | Description |
|---|---|---|---|
provider | String | "model2vec" | Embedding provider (currently only "model2vec") |
name | String | "minishlab/potion-base-8M" | HuggingFace model ID |
revision | String | (none) | Pin to a specific HuggingFace commit SHA for reproducibility |
The provider field can be omitted — it defaults to "model2vec". The revision field only appears when explicitly set (e.g., via build --set-revision).
Changing the model or revision after a build requires build --force to re-embed all files.
[chunking]
Controls semantic text splitting before embedding.
[chunking]
max_chunk_size = 1024
| Field | Type | Default | Description |
|---|---|---|---|
max_chunk_size | Integer | 1024 | Maximum chunk size in characters |
The text splitter breaks each file’s body into semantic chunks respecting markdown structure (headings, paragraphs, lists). Changing the chunk size after a build requires build --force.
[build]
Build workflow settings.
[build]
auto_update = true
| Field | Type | Default | Description |
|---|---|---|---|
auto_update | Boolean | false | Auto-run update before building |
When auto_update is true, build runs the update pipeline before building. Use --no-update to skip.
[search]
Settings for the search command, including how internal columns are named in --where queries.
[search]
default_limit = 10
| Field | Type | Default | Description |
|---|---|---|---|
default_limit | Integer | 10 | Maximum results when --limit is not specified |
internal_prefix | String | "" | Prefix for internal column names in --where queries |
aliases | Map | {} | Per-column name overrides for internal columns |
auto_update | Boolean | false | Auto-run update before building (when auto_build is true) |
auto_build | Boolean | false | Auto-run build before searching |
Internal column names
Beyond your frontmatter fields, the search index stores bookkeeping columns that mdvs uses internally. These internal columns are available in --where queries:
| Column | Contains |
|---|---|
filepath | Relative file path (e.g., blog/post.md) |
file_id | Unique identifier for each file |
chunk_text | The plain-text body of each chunk (useful for --where "chunk_text LIKE '%foo%'") |
content_hash | Hash of the file body |
built_at | Timestamp of last build |
(Other columns — chunk_id, chunk_index, start_line, end_line, embedding — exist too but are rarely useful in --where.)
By default, these are referenced by their raw names:
--where "filepath LIKE 'blog/%'"
If a frontmatter field name collides with an internal column name (e.g., you have a field called filepath), the search command will error and suggest resolutions:
-
Set a prefix so internal columns are addressed with a leading marker in
--where:[search] internal_prefix = "_"Now
_filepath,_file_id, etc. refer to the internal columns in--whereclauses, leaving the barefilepathfree to mean your frontmatter field. (The on-disk column names don’t change — only how the--wheretranslator interprets them.) -
Set a per-column alias to rename just the colliding column in
--where:[search.aliases] filepath = "path"Now
pathrefers to the internalfilepathcolumn, and barefilepathrefers to your frontmatter field. -
Rename the frontmatter field in your markdown files.
Aliases take precedence over the prefix. See the Search Guide for full --where reference.
[fields]
Defines field constraints and the ignore list. This is the largest section — it contains one [[fields.field]] entry per constrained field.
Ignore list
[fields]
ignore = ["internal_id", "temp_notes"]
Fields in the ignore list are known but unconstrained — they skip all validation and are not reported as new fields by check or update. A field cannot be in both ignore and [[fields.field]].
Field definitions
Each [[fields.field]] entry defines constraints on a frontmatter field:
[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "projects/**"]
required = ["blog/**", "projects/**"]
nullable = false
| Field | Type | Default | Description |
|---|---|---|---|
name | String | (required) | Frontmatter key |
type | FieldType | "String" | Expected value type |
allowed | Array(String) | ["**"] | Glob patterns where the field may appear |
required | Array(String) | [] | Glob patterns where the field must be present |
nullable | Boolean | true | Whether null values are accepted |
constraints | Table | (absent) | Optional value constraints (see Constraints) |
preprocess | Array(String) | [] | Stage 2 value preprocessors — see Preprocessors |
All fields except name have permissive defaults. A minimal entry with just a name:
[[fields.field]]
name = "title"
is equivalent to:
[[fields.field]]
name = "title"
type = "String"
allowed = ["**"]
required = []
nullable = true
This is not the same as putting the field in the ignore list. Both prevent the field from being reported as new during update, but a [[fields.field]] entry tracks the field — it appears in info output with its type and patterns, and can be targeted by update reinfer. The ignore list simply silences the field: no validation, no detail in info.
Type syntax
Scalar types are plain strings:
type = "String" # also: "Boolean", "Integer", "Float", "Date", "DateTime"
Date and DateTime accept RFC 3339 values only (YYYY-MM-DD for Date, YYYY-MM-DDTHH:MM:SS[.frac]<Z|±HH:MM> for DateTime). See Date and DateTime for the exact accepted shapes and storage semantics.
Arrays use a function-style string:
type = "Array(String)"
Structured types are not supported on disk. Nested Objects in frontmatter are expressed via dotted-name leaf fields — see Types for the flattening rule. Arrays of structured items (Array(Object{...})) have no first-class representation in v0; use parallel scalar arrays as a workaround:
# Instead of an unsupported Array(Object{timestamp, value}):
[[fields.field]]
name = "measurement_timestamps"
type = "Array(String)"
[[fields.field]]
name = "measurement_values"
type = "Array(Float)"
The valid type grammar is:
Type := Scalar | Array(Scalar)
Scalar := String | Integer | Float | Boolean | Date | DateTime
See Types for the full type system, including widening rules.
Path patterns
allowed and required are lists of glob patterns matched against relative file paths:
allowed = ["blog/**", "projects/alpha/**"]
required = ["blog/published/**"]
Patterns must end with /* (direct children) or /** (full subtree), or be exactly * or **. Bare paths like blog or file names like blog/post.md are not valid.
The invariant required ⊆ allowed is enforced — every required glob must be covered by some allowed glob. For example, allowed = ["meetings/**"] covers required = ["meetings/all-hands/**"] because any path matching the required pattern also matches the allowed one.
See Schema Inference for how these patterns are computed.
Constraints
The optional [fields.field.constraints] sub-table adds value constraints beyond type checking.
categories — restricts values to an enumerated set (String, Integer, or arrays of either):
[[fields.field]]
name = "status"
type = "String"
[fields.field.constraints]
categories = ["active", "archived", "completed", "draft", "published"]
min / max — restricts numeric values to an inclusive range (Integer, Float, or arrays of either). Both bounds are optional:
[[fields.field]]
name = "rating"
type = "Integer"
[fields.field.constraints]
min = 1
max = 5
min_length / max_length — bounds string length (Unicode scalar count) or array length:
[[fields.field]]
name = "slug"
type = "String"
[fields.field.constraints]
min_length = 3
max_length = 64
pattern — regex applied to string values, compiled at config load:
[[fields.field]]
name = "version"
type = "String"
[fields.field.constraints]
pattern = '^v\d+\.\d+\.\d+$'
Categories are auto-inferred during init and update reinfer. Range constraints are not auto-inferred but can be inferred on demand with update reinfer <field> --with=range. Length and pattern are not auto-inferred — add them by hand. See Constraints for the full reference.
Preprocessors
The optional preprocess array on a field declares value transformations that run before validation. Two built-in stages:
| Stage | Applies to | Effect |
|---|---|---|
coerce-to-string | String, Array(String) | Serialize non-string JSON values to their JSON string form before validation |
widen-int-to-float | Float, Array(Float) | Treat integer values as their float equivalent |
[[fields.field]]
name = "priority"
type = "String"
preprocess = ["coerce-to-string"]
Preprocessors are auto-inferred during init and update reinfer based on observed type-widening events: a field that widened to String because of mixed-type observations gets coerce-to-string; a Float field that observed integers gets widen-int-to-float. An empty preprocess array means strict validation — no coercion.
Each entry must be applicable to the field’s type, and duplicates are rejected at config load. See Types & Widening for the full rules.
Inference thresholds
Two optional fields in [fields] control categorical auto-inference:
[fields]
max_categories = 10
min_category_repetition = 3
| Field | Type | Default | Description |
|---|---|---|---|
max_categories | Integer | 10 | Max distinct values for a field to be inferred as categorical |
min_category_repetition | Integer | 3 | Min average repetition (occurrences / distinct) for categorical inference |
These are hidden from mdvs.toml when set to their defaults. They only affect auto-inference — manually written categories are unaffected.
Example
A representative subset from example_kb/mdvs.toml (37 fields total, 4 shown):
[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"
[chunking]
max_chunk_size = 1024
[search]
default_limit = 10
[fields]
ignore = []
[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
required = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
nullable = false
[[fields.field]]
name = "tags"
type = "Array(String)"
allowed = ["blog/**", "projects/alpha/*", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/*", "projects/beta/notes/**"]
required = ["blog/published/**", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/notes/**"]
nullable = false
[[fields.field]]
name = "drift_rate"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = true
# Nested YAML (calibration.baseline.wavelength, etc.) is expressed as
# one [[fields.field]] per leaf — see Types.
[[fields.field]]
name = "calibration.baseline.wavelength"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = []
nullable = false
Search Guide
The --where flag on search lets you filter results by frontmatter fields using SQL syntax. The filter is combined with similarity ranking in a single query — files that don’t match are excluded before results are returned.
Under the hood, mdvs hands the clause to LanceDB’s SQL filter, which is built on top of DataFusion — so any expression valid in DataFusion’s SQL dialect works in --where.
Limitation.
--whereclauses that reference anArray(Float)field (e.g.measurement_values) are rejected up front, because the underlying search engine can’t safely decode them and crashes on read. mdvs catches this before the query runs and returns a clear error. Filter on a scalar field, or store the data as a parallel array of strings, instead.
Scalar fields
Use bare field names for simple comparisons:
String
mdvs search "experiment" --where "status = 'active'"
mdvs search "experiment" --where "author = 'Giulia Ferretti'"
mdvs search "experiment" --where "status IN ('active', 'archived')"
mdvs search "experiment" --where "title LIKE '%sensor%'"
Numeric
mdvs search "experiment" --where "sample_count > 20"
mdvs search "experiment" --where "drift_rate >= 0.01 AND drift_rate <= 0.05"
mdvs search "experiment" --where "wavelength_nm BETWEEN 600 AND 800"
Searched "experiment" — 2 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ experiment │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 10 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/alpha/notes/experiment-3.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.420 │
...
...
Boolean
mdvs search "announcement" --where "draft = false"
mdvs search "ideas" --where "draft = true"
Null checks
mdvs search "notes" --where "drift_rate IS NOT NULL"
mdvs search "notes" --where "review_score IS NULL"
Combining conditions
Use AND, OR, and NOT to build compound filters:
mdvs search "experiment" --where "status = 'active' AND priority = 1"
mdvs search "notes" --where "author = 'REMO' OR author = 'Marco Bianchi'"
mdvs search "notes" --where "NOT status = 'archived'"
Date and DateTime
Fields typed as Date (Arrow Date32) and DateTime (Arrow Timestamp(Millisecond, UTC)) support native date arithmetic, comparisons, and the usual SQL date functions. Auto-inferred from RFC 3339 strings — see Date and DateTime for the type itself.
Direct comparison
mdvs search "researcher" --where "joined > '2024-01-01'"
mdvs search "meeting" --where "date < '2032-01-01'"
mdvs search "calibration" --where "synced_at >= '2024-04-01T00:00:00Z'"
DateTime offsets are normalized to UTC at storage time, so 2024-04-02T16:14:30+02:00 (in a YAML file) and 2024-04-02T14:14:30Z (in a --where clause) compare as the same absolute moment.
Range filters (BETWEEN)
mdvs search "meeting" --where "date BETWEEN '2031-09-01' AND '2031-11-30'"
mdvs search "report" --where "joined BETWEEN '2023-01-01' AND '2024-12-31'"
Date functions (EXTRACT, date_part)
Both extract numeric components from Date and DateTime. Two equivalent syntaxes:
mdvs search "meeting" --where "EXTRACT(YEAR FROM date) = 2031"
mdvs search "meeting" --where "date_part('year', date) = 2031"
mdvs search "meeting" --where "EXTRACT(MONTH FROM date) = 10"
mdvs search "calibration" --where "EXTRACT(YEAR FROM synced_at) = 2024 AND EXTRACT(MONTH FROM synced_at) <= 3"
Date arithmetic with INTERVAL
The SQL engine supports adding/subtracting intervals to dates and datetimes.
# Joined within the last 2 years (relative to a cutoff date)
mdvs search "researcher" --where "joined > CAST('2032-01-01' AS DATE) - INTERVAL '2 years'"
# Datetime offset by days
mdvs search "experiment" \
--where "synced_at < CAST('2024-04-15T00:00:00Z' AS TIMESTAMP) - INTERVAL '7 days'"
CAST('...' AS DATE) and CAST('...' AS TIMESTAMP) are usually needed for string literals on the right side of the arithmetic — the SQL type inference doesn’t always pick the date/timestamp type automatically.
Date subtraction (days between)
Subtracting two Date values returns a number of days (an integer):
# People who joined more than 365 days before a cutoff
mdvs search "researcher" --where "CAST('2032-01-01' AS DATE) - joined > 365"
Null checks
Date and DateTime columns support standard null predicates, including for fields scoped to a subset of directories (rows outside the scope have null values for that column):
mdvs search "protocol" --where "last_reviewed IS NOT NULL"
mdvs search "experiment" \
--where "drift_rate IS NULL AND filepath LIKE 'projects/alpha/notes/%'"
Combining with other filters
Date filters compose freely with the rest of the language — string compare, IN, LIKE, dotted-leaf access, array operations, and search ranking:
# Blog posts in 2031 H2 by specific authors
mdvs search "research" \
--where "filepath LIKE 'blog/published/%' AND author IN ('Marco Bianchi', 'Giulia Ferretti') AND date BETWEEN '2031-07-01' AND '2031-12-31'"
# High-or-medium priority experiments with baseline > 700nm synced in 2024
mdvs search "experiment SPR" \
--where "(priority = 'high' OR priority = 'medium') AND calibration.baseline.wavelength > 700 AND EXTRACT(YEAR FROM synced_at) = 2024"
Array fields
Fields typed as Array(String) (like tags, attendees, action_items) support array functions.
Containment
mdvs search "calibration" --where "array_has(tags, 'calibration')"
Searched "calibration" — 4 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ calibration │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 10 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/alpha/notes/experiment-1.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.478 │
...
...
The SQL-standard ANY syntax also works:
mdvs search "calibration" --where "'calibration' = ANY(tags)"
Multiple tags
Combine with AND to require multiple values:
mdvs search "calibration" --where "array_has(tags, 'calibration') AND array_has(tags, 'SPR-A1')"
Array length
mdvs search "meeting" --where "array_length(action_items) > 2"
Filtering by file path
Filter results by file path using the filepath column:
mdvs search "experiment" --where "filepath LIKE 'projects/alpha/%'"
Searched "experiment" — 8 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ experiment │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 10 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/alpha/notes/experiment-3.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.420 │
...
...
File paths are stored as relative paths (e.g., projects/alpha/notes/experiment-1.md), so use LIKE with % for path prefix matching:
# All blog posts
--where "filepath LIKE 'blog/%'"
# Only published blog posts
--where "filepath LIKE 'blog/published/%'"
# Files in any meetings directory
--where "filepath LIKE '%/meetings/%'"
Nested objects
Fields typed as Object (like calibration in example_kb) are stored as nested Struct columns. Access nested values with bracket notation:
mdvs search "sensor" --where "calibration['baseline']['wavelength'] > 600"
Searched "sensor" — 2 hits
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query │ sensor │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model │ minishlab/potion-base-8M │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit │ 10 │
└──────────────────────────┴───────────────────────────────────────────────────┘
┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file │ projects/alpha/notes/experiment-2.md │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score │ 0.414 │
...
...
The top-level field name (calibration) can be used bare. Only the nested access needs brackets:
# These are equivalent:
--where "calibration['baseline']['wavelength'] > 600"
--where "_data['calibration']['baseline']['wavelength'] > 600"
Field names with special characters
Some field names need quoting in SQL. The init, update, and info commands show hints in their output when this applies.
Spaces
Double-quote the field name:
mdvs search "query" --where "\"lab section\" = 'optics'"
Single quotes in field names
Also use double-quoting:
mdvs search "query" --where "\"author's_note\" IS NOT NULL"
Double quotes in field names
Double the double quotes inside the identifier:
mdvs search "query" --where "\"notes\"\"v2\"\" = true"
String values with special characters
To include a literal single quote inside a string value, double it:
mdvs search "query" --where "title = 'What''s New?'"
mdvs validates quote balance before running the query. If you see “unmatched single quote”, check that every ' in a value is doubled.
Tips
-
Case sensitivity: field names and string values are case-sensitive. Use
LOWER()for case-insensitive matching:--where "LOWER(author) = 'giulia ferretti'" -
LIKE patterns:
%matches any sequence,_matches a single character:--where "title LIKE 'Project%'" # starts with "Project" --where "title LIKE '%sensor%'" # contains "sensor" -
NULL semantics: comparisons against NULL always return false. Use
IS NULL/IS NOT NULL, not= NULL. -
No aggregates in –where: functions like
COUNT()orSUM()don’t work in--where— the filter applies per-file, not across results.