Introduction

mdvs treats your markdown directory like a database. It scans your files, infers a typed schema from frontmatter, validates it, and builds a local search index — all in a single binary with no external services.

Not a document database. A database for documents.

The challenge

Markdown directories grow organically. You start with a few notes, add frontmatter when it’s useful, and eventually have hundreds of files with inconsistent metadata. Tags are misspelled. Required fields are missing. You can’t find anything without grep.

mdvs gives you structure without forcing you to change how you write.

Frontmatter

Frontmatter is the YAML block between --- fences at the top of a markdown file. It stores structured metadata alongside your content:

---
title: "Experiment A-017: SPR-A1 baseline calibration"    # String
status: completed                                         # String
author: Federica Bianchi                                  # String
draft: false                                              # Boolean
priority: 2                                               # Integer
drift_rate: 0.023                                         # Float
tags:                                                     # Array(String)
  - calibration
  - SPR-A1
  - baseline
---
# Your markdown content starts here...

mdvs recognizes these types automatically. When it scans your files, it infers the type of each field from the values it finds — no configuration needed.

TOML (+++) and JSON ({...}) frontmatter are also supported, auto-detected per file. This guide uses YAML throughout; see [scan].frontmatter_format for the format knob and the Hugo recipe for mixed-format vaults.

Directory-aware schema

mdvs infers a three-dimensional schema from your files:

Types — boolean, integer, float, string, arrays, nested objects. Inferred automatically, with widening when files disagree.
Paths — which fields belong in which directories. draft only in blog/, sensor_type only in projects/alpha/notes/. Captured as allowed and required glob patterns.
Nullability — whether a field can be null. Tracked per field.

This means different directories can have different fields with different constraints — all inferred automatically from your existing files.

Tightest fit: mdvs init infers the strictest schema that’s consistent with your existing files. A field is inferred as allowed in a directory if at least one file there has it. It’s inferred as required if every file there has it. These rules propagate up — if every subdirectory requires a field, the parent directory does too. The result is the tightest set of constraints where check still returns zero violations. You can always loosen them later.

Two layers

mdvs has two distinct capabilities that work independently:

Validation — Scan your files, infer what frontmatter fields exist, which directories they appear in, and what types they have. Write the result to mdvs.toml. Then validate files against that schema. No model, no index, nothing to download.

Search — Chunk your markdown, embed it with a lightweight local model, store the chunks and vectors in a Lance dataset under .mdvs/, and query with natural language. Choose semantic (vector), full-text (BM25), or hybrid (both, reranked) — and filter results on any frontmatter field using standard SQL.

You need validation without search? Run mdvs init, customize the fields in mdvs.toml, and run mdvs check.

You want search without validation? Just run mdvs init and mdvs search. The inferred schema is used to extract metadata for search results, but you don’t have to worry about it if you don’t want to.

Use them together for the best experience, or separately if that’s what you need.

Using a nested directory of markdown files as a database

You can think of mdvs as a layer on top of your markdown files that gives you database-like capabilities. Here’s a rough mapping of concepts and commands:

Concept	Database	mdvs
Define structure	`CREATE TABLE`	`mdvs init`
Per-table columns	Different columns per table	Per-directory fields via `allowed`/`required` globs
Enforce constraints	Constraint validation	`mdvs check`
Evolve structure	`ALTER TABLE`	`mdvs update`
Create an index	`CREATE INDEX`	`mdvs build`
Query	`SELECT ... WHERE ... ORDER BY`	`mdvs search --where`

Two artifacts: mdvs.toml (your schema, to be committed) and .mdvs/ (the search index, can be ignored by version control).

What this book covers

This book uses a fictional research lab knowledge base (example_kb) as a running example. Every command, every output, every query is real and reproducible.

Getting Started — Install mdvs and run it on the example vault
Concepts — How schema inference, types, and validation work
Commands — Full reference for all 8 commands
Configuration — The mdvs.toml file explained
Search Guide — SQL filtering, array queries, and ranking
Recipes — Obsidian setup, CI integration

Getting Started

Install mdvs, run it on a real directory, and search your first query — all in under five minutes.

Install

cargo install mdvs

You need a working Rust toolchain. Prebuilt binaries will be available once the crate is published.

Get the example files

This book uses a fixture called example_kb — a fictional research lab’s knowledge base with ~46 markdown files, varied frontmatter, and a few deliberate inconsistencies. Clone the repo to follow along:

git clone https://github.com/edochi/mdvs.git
cd mdvs

Initialize

Run mdvs init on the example directory:

mdvs init example_kb

mdvs scans every markdown file, extracts frontmatter, and infers a typed schema. Each discovered field is shown as its own key-value table:

Initialized 43 files — 37 field(s)

┌ draft ───────────────────┬───────────────────────────────────────────────────┐
│ type                     │ Boolean                                           │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 8 out of 43                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ blog/**                                           │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ blog/**                                           │
└──────────────────────────┴───────────────────────────────────────────────────┘

...

┌ sensor_type ─────────────┬───────────────────────────────────────────────────┐
│ type                     │ String                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 3 out of 43                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ projects/alpha/notes/**                           │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ projects/alpha/notes/**                           │
└──────────────────────────┴───────────────────────────────────────────────────┘

...

┌ title ───────────────────┬───────────────────────────────────────────────────┐
│ type                     │ String                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 37 out of 43                                      │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ blog/**                                           │
│                          │ meetings/**                                       │
│                          │ people/**                                         │
│                          │ projects/**                                       │
│                          │ reference/protocols/**                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ blog/**                                           │
│                          │ meetings/**                                       │
│                          │ people/**                                         │
│                          │ projects/**                                       │
│                          │ reference/protocols/**                            │
└──────────────────────────┴───────────────────────────────────────────────────┘

Initialized mdvs in 'example_kb'

That command did three things:

Scanned 43 markdown files and extracted their YAML frontmatter
Inferred 37 typed fields — strings, integers, floats, booleans, arrays, even a nested object (calibration)
Wrote mdvs.toml with the inferred schema

Notice the files row: draft appears in 8 out of 43 files — all in blog/. sensor_type in 3 out of 43 — all in projects/alpha/notes/. mdvs captured not just the types, but where each field belongs, via the required and allowed glob patterns.

Here’s what a field definition looks like in mdvs.toml:

[[fields.field]]
name = "sensor_type"
type = "String"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = false

This means sensor_type is allowed only in experiment notes, and required there. If it appears in a blog post, check will flag it. If it’s missing from an experiment note, check will flag that too.

One artifact is created by init: mdvs.toml — the schema file. Commit this to version control. The .mdvs/ directory (search index) is created later on first build or search.

Validate

Check that every file conforms to the schema:

mdvs check example_kb

Checked 43 files — no violations

Since mdvs init just inferred the schema from these same files, everything passes. The power of check comes after you tighten the schema — or when files drift from it. Try adding sensor_type: SPR-A1 to a blog post — mdvs will flag it as Disallowed because that field doesn’t belong there.

What violations look like

Open mdvs.toml and make a few changes to tighten the constraints:

Require observation_notes in all experiment files (currently optional)
Change convergence_ms type from Integer to Boolean (simulating a type mismatch)
Set drift_rate to non-nullable (one file has drift_rate: null)
Restrict firmware_version to only appear in people/interns/** (it currently appears in people/*)

Run check again:

mdvs check example_kb

Checked 43 files — 4 violation(s)

Violations (4):
┌ convergence_ms ──────────┬───────────────────────────────────────────────────┐
│ kind                     │ Wrong type                                        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ type Boolean                                      │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ projects/beta/notes/initial-findings.md (got Inte │
│                          │ ger)                                              │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ kind                     │ Null value not allowed                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ not nullable                                      │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ projects/alpha/notes/experiment-2.md              │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ firmware_version ────────┬───────────────────────────────────────────────────┐
│ kind                     │ Not allowed                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ allowed in ["people/interns/**"]                  │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ people/remo.md                                    │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ observation_notes ───────┬───────────────────────────────────────────────────┐
│ kind                     │ Missing required                                  │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ required in ["projects/alpha/notes/**"]           │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ projects/alpha/notes/experiment-1.md              │
│                          │ projects/alpha/notes/experiment-2.md              │
└──────────────────────────┴───────────────────────────────────────────────────┘

Four violation types, each catching a different kind of problem:

Violation	Meaning
`Missing required`	A file in a required path is missing the field
`Wrong type`	The value doesn’t match the declared type
`Null value not allowed`	The field is present but `null`, and `nullable` is `false`
`Not allowed`	The field appears in a file outside its `allowed` paths

Each violation table shows the field name, the kind of violation, the violated rule, and the affected files. See check for the full reference.

Revert your changes to mdvs.toml before continuing (or re-run mdvs init example_kb --force to regenerate it).

Search

Query the index with natural language. On first run, search auto-builds the index:

Note: The first search or build downloads the embedding model from HuggingFace (~30 MB for the default model). This is a one-time download — subsequent runs use the cached model and start instantly.

mdvs search "calibration" example_kb

Searched "calibration" — 10 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ calibration                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 10                                                │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/alpha/meetings/2031-06-15.md             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.585                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines                    │ 14-22                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text                     │ # Alpha Kickoff — Calibration Campaign ...        │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #2 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/alpha/meetings/2031-10-10.md             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.501                                             │
...

...

By default mdvs search runs in hybrid mode — it combines a semantic (vector) match with a full-text (BM25) match and reranks the results, so a typo-friendly natural-language query and an exact-keyword query both work. The score is a relevance score from the reranker (higher is better). Pass --mode semantic or --mode fulltext to use one signal alone. The text row shows the best-matching chunk from each file.

Filtering with `--where`

Add a SQL filter on any frontmatter field:

mdvs search "quantum" example_kb --where "status = 'active'"

Searched "quantum" — 3 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ quantum                                           │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 10                                                │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/beta/overview.md                         │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.123                                             │
...

...

Only files with status: active in their frontmatter are included. The --where clause supports any SQL expression — boolean logic, comparisons, array functions, and more. See the Search Guide for the full syntax.

What’s next

Concepts — How schema inference, types, and validation work under the hood
Commands — Full reference for every command and flag
Configuration — Customize mdvs.toml to tighten your schema
Search Guide — Complex queries: arrays, nested objects, combined filters

Concepts

mdvs has two layers — validation and search — each with its own set of concepts. These pages explain how things work under the hood.

Types & Widening — The type system, how types are inferred from values, and what happens when files disagree
Schema Inference — How mdvs scans your directory and computes field paths, requirements, and constraints
Validation — What check verifies, the five violation types, and how to read the output
Constraints — Categorical constraints, auto-inference heuristics, and manual overrides
Search & Indexing — Chunking, embeddings, incremental builds, and how results are ranked

Types & Widening

mdvs infers a type for every frontmatter field it encounters. When the same field appears with different types across files, mdvs resolves the conflict automatically through type widening.

The supported types

Type	YAML example	example_kb field
Boolean	`draft: false`	`draft` in blog posts
Integer	`sample_count: 24`	`sample_count` in experiments
Float	`drift_rate: 0.023`	`drift_rate` in experiments
String	`author: Giulia Ferretti`	`author` across many files
Date	`joined: 2023-02-01`	`joined`, `date`, `commission_date`, `last_reviewed`
DateTime	`synced_at: "2024-04-02T16:14:30+02:00"`	`synced_at` in experiments
Array(Scalar)	`tags: [calibration, SPR-A1]`	`tags` in projects and blog

The on-disk type grammar is tight:

Type   := Scalar | Array(Scalar)
Scalar := String | Integer | Float | Boolean | Date | DateTime

Array(Array(...)) and Array(Object{...}) are not representable on disk — see Arrays of structured items below for the workaround.

Date and DateTime are described in detail in Date and DateTime below.

Nested Objects in YAML are expressed as dotted-name leaf fields in mdvs.toml. A frontmatter shape like:

calibration:
  baseline:
    wavelength: 632.8
    intensity: 0.95
  adjusted:
    wavelength: 633.1
    intensity: 0.97

infers as five separate leaf fields, one per nested path:

calibration.baseline.wavelength → Float
calibration.baseline.intensity → Float
calibration.adjusted.wavelength → Float
calibration.adjusted.intensity → Float

Each leaf gets its own nullability and allowed/required glob set. This avoids the readability and per-leaf-validation problems of monolithic Object types. Top-level Object types are not supported in mdvs.toml, and neither are Objects nested inside Array fields — see Arrays of structured items below.

Arrays of structured items

A YAML field like:

measurements:
  - timestamp: "14:02:11"
    value: 0.612
  - timestamp: "14:03:00"
    value: 0.598

has no first-class representation on disk in v0. Inference detects the Array(Object{...}) shape, skips the field, and emits a warning to stderr:

warning: skipped field 'measurements' — Array(Object{...}) isn't representable on disk.
  Consider parallel scalar arrays (see TODO-0156). (first observed in projects/alpha/notes/experiment-2.md)

The recommended workaround is parallel scalar arrays — one field per element-leaf. Replace the YAML above with:

measurement_timestamps: ["14:02:11", "14:03:00"]
measurement_values: [0.612, 0.598]

and the corresponding mdvs.toml:

[[fields.field]]
name = "measurement_timestamps"
type = "Array(String)"

[[fields.field]]
name = "measurement_values"
type = "Array(Float)"

The downside is the loss of per-element grouping — there’s no schema-level guarantee that measurement_timestamps[3] and measurement_values[3] belong to the same record. A first-class Array-of-structured-item representation is tracked in TODO-0156.

Date and DateTime

Both types use RFC 3339 as the canonical wire format — a strict subset of ISO 8601 designed for machine interoperability.

`Date` — calendar date, no time

Date   = YYYY-MM-DD

Rules:

Exactly 4-digit year, 2-digit month, 2-digit day (no 2024-1-1 shorthand).
Hyphen separators.
Calendar-valid: 2024-13-01 (month 13) and 2024-02-30 (no Feb 30th) are rejected.
No time component, no timezone.

Accepted:

2023-02-01
1990-05-12
2024-02-29        ← valid leap-year date

Rejected:

2024-1-1          ← single-digit components not allowed
2024-13-01        ← month must be 01-12
"see 2024-01-15"  ← must be the whole string
2024/01/15        ← only hyphens

Stored as Arrow Date32 (days since 1970-01-01). Native date arithmetic works in --where queries — e.g. WHERE date > '2024-01-01', WHERE date_part('year', published) = 2024, WHERE date BETWEEN '2024-06-01' AND '2024-06-30'. See Date and DateTime in –where queries for worked examples including EXTRACT, INTERVAL, date subtraction, and compound filters.

`DateTime` — date + time, mandatory timezone

DateTime = YYYY-MM-DDTHH:MM:SS[.frac]<tz>

<tz>     = 'Z'                        ← UTC shorthand
         | '+HH:MM'                   ← positive offset
         | '-HH:MM'                   ← negative offset

Rules:

Date part: same as Date above.
T separator between date and time is mandatory — no space alternative.
HH:MM:SS (24-hour, all two digits). Seconds are required.
Fractional seconds optional, any number of digits.
Timezone is mandatory — naive 2024-01-15T14:30:00 is rejected (not valid RFC 3339).

Accepted:

2024-01-15T14:30:00Z              ← Zulu = UTC
2024-01-15T14:30:00+00:00         ← same moment, explicit offset
2024-04-02T16:14:30+02:00         ← positive offset
2024-01-15T14:30:00-08:00         ← negative offset
2024-01-15T14:30:00.123Z          ← fractional seconds
2024-01-15T14:30:00.123456789Z    ← nanosecond precision

Rejected:

2024-01-15T14:30:00               ← no timezone
2024-01-15 14:30:00Z              ← space instead of T
2024-01-15T14:30                  ← seconds required
2024-13-01T14:30:00Z              ← invalid month
2024-01-15T25:30:00Z              ← invalid hour

Stored as Arrow Timestamp(Millisecond, "UTC"). Offsets are normalized to UTC at storage time — 2024-04-02T16:14:30+02:00 and 2024-04-02T14:14:30Z are the same absolute moment and store identically. The original offset is intentionally not preserved.

example_kb demonstration

Both types are auto-inferred in the example vault:

Field	Type	Files
`joined`	Date	`people/**`
`date`	Date	meetings + blog/published
`commission_date`	Date	`people/*`
`last_reviewed`	Date	`reference/protocols/**`
`synced_at`	DateTime	`experiment-1.md` uses `Z`, `experiment-2.md` uses `+02:00`

No manual configuration was needed for any of these — inference detects the RFC 3339 shape and assigns the appropriate type. See Type widening in practice below for the inference rule.

Validation

JSON Schema’s format: date and format: date-time keywords validate values at check time. Bad shapes (invalid calendar dates, missing timezones, wrong separators) produce WrongType violations with a rule like format date or format date-time.

Constraints

categories applies (e.g. categories = ["2024-01-01", "2024-12-31"] on a Date field; values are strings, the runtime format validator catches malformed entries).
pattern, min, max, min_length, max_length do not apply — the type’s format is itself the pattern. Bounded date ranges (e.g. “published in 2024”) are tracked as a future feature.

Preprocessors

No preprocessor applies to Date or DateTime in v1. Unlike String (which can opt in to coerce-to-string) or Float (which can opt in to widen-int-to-float), date types are strict — either the string parses as RFC 3339 or it doesn’t.

Type hierarchy

When two values have different types, mdvs widens to a common type. The hierarchy looks like this:

graph BT
    Integer --> Float
    Float --> String
    Boolean --> String
    Date --> String
    DateTime --> String
    Array["Array(T)"] --> String

Each arrow means “widens to.” String is the top type — every type eventually reaches it.

The one special case is Integer → Float: integers widen to floats (not directly to String) because the conversion is lossless. Date and DateTime have no internal cross-promotion — mixed Date + DateTime observations widen to String (the two shapes are disjoint).

Two same-category combinations widen internally instead of jumping to String:

Array + Array — element types are widened recursively (e.g., Array(Integer) + Array(String) → Array(String))
Object + Object — at the leaf level: each dotted path’s type is widened independently across files. A file with cal.wave = 850 (Integer) and another with cal.wave = 632.8 (Float) yields cal.wave: Float. New leaf paths in some files are added to the schema; leaves absent from some files affect nullability/required-globs naturally.

Everything else (Boolean + any other type, Array + scalar, Object + scalar) widens to String. The one exception is Array containing Object — Array(Object{...}) isn’t representable on disk, so inference drops the field with a warning instead of widening to String (see Arrays of structured items).

Type widening in practice

When mdvs scans your files and the same field has different types, it picks the least upper bound — the most specific type that covers all observed values.

Integer + Float → Float

In example_kb, the wavelength_nm field appears in three experiment notes:

# experiment-1.md
wavelength_nm: 850       # Integer

# experiment-2.md
wavelength_nm: 632.8     # Float

# experiment-3.md
wavelength_nm: 780.0     # Float

Result: wavelength_nm is inferred as Float. The integer 850 is safely represented as a float.

Integer + String → String

The priority field uses numbers in one project and text in another:

# projects/alpha/overview.md
priority: 1              # Integer

# projects/beta/overview.md
priority: high           # String

Result: priority is inferred as String. There’s no numeric type that can hold "high", so mdvs widens to String.

Boolean + any non-Boolean → String

If the same field is true in one file and 3 in another, there’s no numeric or boolean type that can hold both. The result is String.

This doesn’t happen in example_kb because booleans (draft) are used consistently — but it’s a common mistake in organically grown vaults where someone writes draft: yes (String) instead of draft: true (Boolean).

Date and DateTime inference

A string is inferred as Date or DateTime when every observation across all files matches the RFC 3339 shape AND parses as a real value. A single non-matching value downgrades the whole field to String.

Pure-date observations across files:

# people/alice.md
joined: 2023-02-01

# people/bob.md
joined: 2024-09-15

Result: joined is inferred as Date.

One non-date value forces String:

# people/alice.md
joined: 2023-02-01

# people/carol.md
joined: "see HR records"        # not a date

Result: joined widens to String — the second observation can’t be typed as Date, and Date + String → String is the widening rule.

Same logic for invalid calendar dates:

# fileA.md
published: 2024-06-01

# fileB.md
published: 2024-13-01           # invalid month — typed String per-value

Result: published widens to String. The typo gets silently absorbed into String typing; the user only catches it via a WrongType violation if they manually set type = "Date" in mdvs.toml.

Date + DateTime are cross-shape — never auto-promote:

# meeting/a.md
when: 2024-01-15                # Date

# meeting/b.md
when: 2024-01-15T14:30:00Z      # DateTime

Result: when widens to String. Pick one shape consistently to get a typed field.

Array element widening

The tags field is a string array in most files, but one file accidentally used integers:

# projects/alpha/overview.md
tags:
  - biosensor
  - metamaterial          # Array(String)

# projects/beta/notes/replication.md
tags:
  - 1
  - 2
  - 3                     # Array(Integer)

Result: tags is inferred as Array(String). The array element types (String vs Integer) are widened to String, giving Array(String).

Object leaf merging (dotted-name flattening)

When two files have nested keys at the same paths, each leaf is inferred independently. New leaves seen in one file but not another are added to the schema; their required glob naturally narrows to just the files that contain them.

In example_kb, the calibration object appears in two experiment files with different structures:

# experiment-1.md (simpler calibration, integer values)
calibration:
  baseline:
    wavelength: 850            # Integer
    intensity: 1               # Integer
    notes: "initial reference" # only in this file

# experiment-2.md (full calibration, float values)
calibration:
  baseline:
    wavelength: 632.8          # Float
    intensity: 0.95            # Float
  adjusted:                    # only in this file
    wavelength: 633.1
    intensity: 0.97

Result: five dotted-name leaf fields are inferred in mdvs.toml:

[[fields.field]]
name = "calibration.adjusted.intensity"
type = "Float"

[[fields.field]]
name = "calibration.adjusted.wavelength"
type = "Float"

[[fields.field]]
name = "calibration.baseline.intensity"
type = "Float"
preprocess = ["widen-int-to-float"]   # Integer + Float mix → opted in

[[fields.field]]
name = "calibration.baseline.notes"
type = "String"

[[fields.field]]
name = "calibration.baseline.wavelength"
type = "Float"
preprocess = ["widen-int-to-float"]

What happened:

calibration.baseline.wavelength seen as both Integer (850) and Float (632.8) → widened to Float with widen-int-to-float preprocessor recording the mix
calibration.baseline.intensity similar: Integer (1) + Float (0.95) → Float with the preprocessor
calibration.baseline.notes only in experiment-1 → still inferred as String (with a required glob narrowed to just the files that have it)
calibration.adjusted.* only in experiment-2 → inferred from that file alone

The user-facing schema is flat, but its semantics still match the YAML’s nested shape. Validation, storage, and --where queries all operate on the natural nested structure — the dotted-name form is purely a mdvs.toml UX choice.

The full widening matrix

Every possible combination of types and its result:

	Boolean	Integer	Float	String	Date	DateTime	Array	Object
Boolean	Boolean	String	String	String	String	String	String	String
Integer	String	Integer	Float	String	String	String	String	String
Float	String	Float	Float	String	String	String	String	String
String	String	String	String	String	String	String	String	String
Date	String	String	String	String	Date	String	String	String
DateTime	String	String	String	String	String	DateTime	String	String
Array	String	String	String	String	String	String	Array*	dropped**
Object	String	String	String	String	String	String	dropped**	Object*

* Array + Array: element types are widened recursively.

* Object + Object: not a top-level on-disk type. Nested Objects in YAML flatten to dotted-name leaves before widening; each leaf path is widened independently.

** Inference observed Array(Object{…}) — not representable on disk in v0. The field is dropped from the schema and a warning is emitted (see Arrays of structured items).

Date and DateTime are cross-shape — they never auto-promote into each other. The single non-trivial pair is Date + DateTime → String.

The matrix is symmetric — widen(A, B) always equals widen(B, A).

Nullable

Separately from the type, mdvs tracks whether null was observed for a field. This is shown as a ? suffix in output — e.g., Float? means “Float, but sometimes null.”

How it works

In example_kb, the drift_rate field is Float in two experiment files but null in a third:

# experiment-1.md
drift_rate: 0.023        # Float

# experiment-2.md
drift_rate: null          # sensor malfunction — Giulia discarded the data

# experiment-3.md
drift_rate: 0.012         # Float

Result: drift_rate is inferred as Float? — the type is Float (null doesn’t affect the type), and nullable is set to true.

Null-only fields

If the only value ever observed is null, the type defaults to String:

# blog/drafts/grant-ideas.md
review_score: null        # no real values seen

Result: review_score is inferred as String?.

Key rules

Null is transparent in widening — it doesn’t affect the inferred type
Null-only fields default to String (the safest fallback)
nullable is a separate boolean, not part of the type itself
In validation: null values skip type checks, but a non-nullable required field with a null value triggers a NullNotAllowed violation (see Validation)

Widening and preprocessors

Widening picks the type. Preprocessors are how the schema declares what coercions were needed to get there. Inference auto-populates them — you rarely write them by hand.

When inference observes a field as a mix of types (some files have priority: 1, others priority: high), it widens to String and writes:

[[fields.field]]
name = "priority"
type = "String"
preprocess = ["coerce-to-string"]

The coerce-to-string entry tells validation: “before checking this value is a string, serialize whatever you find to its JSON representation.” Without it, the field is strict — integers and booleans fail validation.

Same for Float: a mix of 5 and 5.0 widens to Float with preprocess = ["widen-int-to-float"]. Without it, integers fail the float check.

The two built-in Stage 2 preprocessors:

Preprocessor	Applies to	Effect
`coerce-to-string`	`String`, `Array(String)`	Serialize non-strings to their JSON string representation before validation
`widen-int-to-float`	`Float`, `Array(Float)`	Treat integer values as their float equivalent

preprocess = [] means strict. If you delete a preprocessor from mdvs.toml, the field rejects values that would have been coerced. Conversely, you can hand-add a preprocessor to a strict-inferred field if you want to accept type variation.

No preprocessor applies to Date or DateTime. Those types are strict by design — values either parse as RFC 3339 or they don’t. There is no parse-loose-date opt-in; non-ISO formats fall back to String (and the user can add a pattern constraint if they want a custom shape).

In storage — when validation accepts a coerced value, the coerced form is what gets stored. A priority: 1 value with coerce-to-string becomes "1" in the search index. No data is silently dropped.

Re-run mdvs update reinfer <field> to refresh both the inferred type and the inferred preprocessors after editing source files.

Edge cases

Empty arrays [] default to Array(String) — if real values are added later, the field must be re-inferred with mdvs update reinfer <field> to pick up the new element type
Empty frontmatter (--- followed immediately by ---) is a file with zero fields — not a bare file. It still counts as “having frontmatter” for inference purposes.
Bare files (no --- fences at all) are handled differently — see Schema Inference

Schema Inference

mdvs infers a typed schema from your files automatically — no manual schema definition needed. Run mdvs init, and it scans every markdown file, extracts frontmatter, infers types, and computes path patterns that describe where each field appears. The result is mdvs.toml, which you can then tighten by hand.

What gets scanned

mdvs walks your directory and includes every .md and .markdown file that matches the glob pattern in [scan]:

[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false

Three settings control what’s included:

Setting	Default	Effect
`glob`	`"**"`	Which files to scan. Use narrower globs to exclude subtrees.
`include_bare_files`	`true`	Whether to include files without any YAML frontmatter
`skip_gitignore`	`false`	Whether to ignore `.gitignore` patterns during scan

mdvs also respects .mdvsignore files (same syntax as .gitignore) for excluding paths from scanning without touching your .gitignore.

Bare files vs empty frontmatter

These look similar but are different:

Bare file — no frontmatter fences at all:

This file has no frontmatter. Just content.

Empty frontmatter — fences with nothing between them:

---
---
This file has frontmatter, but zero fields.

In example_kb, four files are bare (scratch.md, lab-values.md, reference/tools.md, reference/glossary.md) and one has empty frontmatter (reference/quick-start.md).

Both types contribute zero fields to inference. The difference matters for validation: a bare file is excluded entirely when include_bare_files = false, while an empty-frontmatter file is always included (it has frontmatter — just none with fields).

From files to fields

For each scanned file, mdvs extracts the YAML frontmatter and infers a type for every key. When the same field appears across multiple files, its type is widened to a common type (see Types & Widening for the full rules).

In example_kb, scanning 43 files produces 37 distinct field names. Some fields like title appear in 37 files. Others like unit_id appear in just one.

The output of this step is a list of fields, each with:

A name
A type (widened across all files where it appears)
A nullable flag (true if any file had a null value)
The set of files where it was found

Path patterns

The most interesting part of inference is how mdvs computes where each field belongs. It produces two sets of glob patterns per field:

allowed — where the field may appear. Any file matching these patterns can have the field without triggering a violation.
required — where the field must appear. Any file matching these patterns that’s missing the field triggers a MissingRequired violation.

How patterns are computed

mdvs builds a directory tree from the scanned files and works bottom-up:

For each directory, it tracks which fields appear in all files (intersection) and which appear in any file (union)
When a field appears in every file under a directory and its subdirectories, it collapses into a recursive glob (dir/**)
When a field appears in some but not all files, only allowed gets the glob — required does not

The result is a minimal set of globs that describes the field’s distribution.

Examples from `example_kb`

Narrow and consistent — sensor_type appears in all three experiment notes and nowhere else:

[[fields.field]]
name = "sensor_type"
type = "String"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]

allowed and required are the same — every file that has this field is in the same directory, and every file in that directory has it.

Broad and consistent — title appears in 37 of 43 files across many directories:

[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
required = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]

Again, allowed equals required — every file in those directories has a title. The five directories without title are bare files at the root and in reference/.

Allowed broader than required — email exists in all people/ files except one:

[[fields.field]]
name = "email"
type = "String"
allowed = ["people/**"]
required = ["people/interns/**"]

allowed is people/** — the field may appear anywhere under people/. But required is only people/interns/** — the one subdirectory where every file happens to have it. In people/* (the non-intern profiles), some have email and some don’t, so it can’t be required there.

Present but never required — ambient_humidity appears in only one of three experiment notes:

[[fields.field]]
name = "ambient_humidity"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = []

required is empty — the field never appears in every file under any directory, so mdvs can’t require it anywhere.

The pattern

The general rule is required ⊆ allowed — you can’t require a field somewhere it’s not allowed. Within that:

required = allowed when every file in a directory has the field
required ⊂ allowed when the field is consistent in some directories but sporadic in others
required = [] when the field is sporadic — present in some files but not consistently in any directory

The three field states

Every field in mdvs.toml is in one of three states:

Constrained

Listed under [[fields.field]]. Validation enforces type, allowed paths, required paths, and nullable. mdvs update preserves constrained fields unless you explicitly use update reinfer.

[[fields.field]]
name = "draft"
type = "Boolean"
allowed = ["blog/**"]
required = ["blog/**"]
nullable = false

Only name is required — properties you omit use permissive defaults:

Property	Default	Meaning
`type`	`String`	Strict string check (add `preprocess = ["coerce-to-string"]` to accept any JSON value)
`allowed`	`["**"]`	Allowed in every file
`required`	`[]`	Not required anywhere
`nullable`	`true`	Null values accepted
`preprocess`	`[]`	No value coercion before validation

A [[fields.field]] with just a name is effectively unconstrained, but still known — useful when you want to acknowledge a field without committing to specific constraints yet.

Ignored

Listed in the ignore array. The field is known but not validated — no type checks, no path checks. mdvs update skips ignored fields entirely.

[fields]
ignore = ["internal_notes", "scratch_data"]

Use this for fields you don’t want to enforce — temporary fields, fields in flux, or fields you’ve decided aren’t worth constraining.

Unknown

Not mentioned in mdvs.toml at all. When mdvs update finds a field that isn’t constrained or ignored, it reports it as a new field and adds it to the schema.

A field can be in exactly one state. Moving a field from constrained to ignored means removing its [[fields.field]] entry and adding its name to ignore. Moving it back means the reverse.

Keeping the schema current

After initial inference with mdvs init, the schema is a snapshot of your files at that moment. As files change — new fields appear, old ones shift — use mdvs update to bring the schema up to date.

Default mode

mdvs update example_kb

Only new fields are added. Existing fields are left untouched, even if their types or paths have changed. This is conservative by design — your manual edits to mdvs.toml are preserved.

Fields that disappear from all files still stay in the toml. This prevents accidental removal when files are temporarily missing.

Re-inferring specific fields

mdvs update example_kb reinfer tags

Treats tags as if it had never been seen — removes it from the schema, re-scans, and infers it fresh. Use this when you’ve fixed bad data (like a tags: [1, 2, 3] that should have been strings) and want the type or paths to update.

Re-inferring everything

mdvs update example_kb reinfer

When no fields are named, every field is reinferred. The entire [[fields.field]] section is rebuilt from scratch, but all other config ([scan], [embedding_model], etc.) is preserved.

This is different from mdvs init --force, which overwrites the entire mdvs.toml including non-field config.

Edge cases

Fields in a single file — get a narrow allowed glob matching just that file’s directory. Example: unit_id only in people/remo.md → allowed = ["people/*"].
Null-only fields — type defaults to String (see Types & Widening). Example: review_score is always null → String?.
Special characters in field names — names with spaces (lab section), single quotes (author's_note), or double quotes (notes"v2") are preserved as-is. They need quoting in --where clauses (see Search Guide).
Empty arrays [] — element type defaults to String, giving Array(String). If real values appear later, use update reinfer to pick up the correct element type.
Nested objects in frontmatter — flattened into dotted-name leaf fields. A YAML key like calibration: { baseline: { wavelength: 850.0 } } becomes a [[fields.field]] entry named calibration.baseline.wavelength with type Float. Each leaf gets its own nullability and allowed/required glob set. Top-level Object types are not supported in mdvs.toml; only nested Objects inside Array fields keep their inline shape (see Types & Widening).

Validation

mdvs check validates every file’s frontmatter against the schema in mdvs.toml. It’s read-only and produces no side effects — it just tells you what’s wrong. The output is byte-stable across runs: violations are sorted by (field, kind, rule) and the files within each violation are sorted by path, so CI tools that diff mdvs check output across runs get a clean comparison regardless of file-walking order.

The seven violations

Violation	Meaning
`WrongType`	The value doesn’t match the declared `type` (or fails a `pattern` regex)
`Disallowed`	The field appears in a file outside its `allowed` paths
`MissingRequired`	A file matches a `required` glob but doesn’t have the field
`NullNotAllowed`	The field is present but `null`, and `nullable` is `false`
`InvalidCategory`	The value is not in the field’s declared `categories`
`OutOfRange`	A numeric value violates `min`/`max`, or a length violates `min_length`/`max_length`
`FrontmatterUnrepresentable`	The file’s frontmatter can’t be represented as JSON (NaN/inf, non-string keys, non-object top-level)

WrongType

Fires when a value doesn’t match the declared type. If convergence_ms is declared as Boolean but a file has convergence_ms: 42, the integer value fails the boolean check.

This violation has two important leniencies — see Type checking rules below.

Disallowed

Fires when a field appears in a file whose path doesn’t match any of the field’s allowed globs. For example, if firmware_version has allowed = ["people/interns/**"] but appears in people/remo.md, that file is outside the allowed paths.

MissingRequired

Fires when a file’s path matches one of the field’s required globs, but the file doesn’t contain that field at all.

For example, if observation_notes has required = ["projects/alpha/notes/**"], then every file under projects/alpha/notes/ must have it. Files that don’t → MissingRequired.

NullNotAllowed

Fires when a field is present with an explicit null value, but nullable is false. For example, if drift_rate has nullable = false and a file has drift_rate: null.

This is distinct from a missing field — see Null vs absent below.

InvalidCategory

Fires when a field has a categories constraint and the value is not in the declared list. For example, if status has categories = ["draft", "published", "archived"] and a file has status: pending, the value "pending" is not in the list.

For array fields, each element is checked individually. The violation detail lists the specific offending elements.

This check only runs on non-null values that pass the type check. If the value has the wrong type, only WrongType fires — InvalidCategory is skipped. If the value is null and the field is nullable, the category check is skipped entirely.

See Constraints for how categories are configured and auto-inferred.

OutOfRange

Fires when a value violates a numeric or length bound:

min / max on numeric fields — rating: 7 with min = 1, max = 5 is above max.
min_length / max_length on string fields — slug: "a" with min_length = 3 is too short.
min_items / max_items on array fields (when emitted by inference) — applies to the array’s length.

For array fields, numeric-element bounds are checked individually. The violation detail lists the specific offending elements or, for length checks, the actual length.

This check only runs on non-null values that pass the type check, same as InvalidCategory.

See Constraints for how bounds are configured.

FrontmatterUnrepresentable

Fires when a file’s YAML frontmatter parses successfully but can’t be represented as JSON. Causes include NaN / inf floats, non-string mapping keys, or a top-level value that isn’t a mapping. The violation is reported at the document level with the sentinel field name <frontmatter>.

Pre-Wave-B mdvs silently dropped these files; they’re now surfaced explicitly so the schema can’t lie about what’s actually in your vault.

Type checking rules

Type checking is strict — a String field rejects integers, a Boolean field rejects strings, and so on. Two opt-in adjustments cover the common YAML pain points:

Preprocessors normalize before validation. A field’s preprocess array runs before jsonschema sees the value. Two built-ins:

coerce-to-string — non-string values (booleans, integers, arrays) are serialized to their JSON string representation, then validated as strings. Auto-inferred when the inferred type widened to String because of mixed-type observations.
widen-int-to-float — integers are widened to equivalent floats. Auto-inferred when the inferred type widened to Float because some files used 5 and others 5.0. Without it, a Float field rejects integer values.

Fields with empty preprocess arrays are validated strictly — there are no implicit leniencies. See Types & Widening for how inference picks the preprocessors.

Recursion. Arrays check element types recursively — an Array(Integer) field rejects ["a", "b"] because the string elements fail the Integer check. Nested frontmatter structure is validated per leaf: a config entry named calibration.baseline.wavelength is checked against the value at the corresponding nested path in the YAML. Missing intermediate Objects mean the leaf is absent — handled by the MissingRequired check.

Pattern. A pattern constraint on a String field is enforced as a regex; pattern failures surface as WrongType (with detail naming the offending value).

Date and DateTime format validation. Date and DateTime fields use JSON Schema’s format: date / format: date-time keywords. Non-conforming values (invalid calendar dates, missing timezones, wrong separators) fire WrongType with a rule like format date or format date-time. See Date and DateTime for the exact accepted shapes.

Engine

Per-value validation runs through the jsonschema crate. mdvs translates mdvs.toml’s [fields] block into a JSON Schema 2020-12 document, compiles one validator per field, runs Stage 2 preprocessors, then validates each value. Errors from jsonschema are mapped exhaustively into the seven ViolationKinds above.

One subtype check runs in Rust ahead of jsonschema: a Float field without widen-int-to-float rejects integer-backed values (5 is rejected, 5.0 is accepted). JSON Schema’s "number" accepts both — but YAML and TOML preserve the int/float distinction at parse time, and so does mdvs.

Null handling

Null interacts with validation in specific ways:

The checks are independent. A null value is checked like any other value — each violation type is evaluated separately:

WrongType — null is accepted by any type, so this never fires on null.
Disallowed — the field is present (the key exists), so Disallowed fires if the path isn’t in allowed.
MissingRequired — null counts as “present”, so this never fires on null.
NullNotAllowed — fires when the value is null and nullable = false.
InvalidCategory — null skips the category check (same as WrongType), so this never fires on null.
OutOfRange — null skips the range check (same as InvalidCategory), so this never fires on null.

A single null field can trigger both Disallowed and NullNotAllowed at the same time.

Null vs absent. These are different situations with different outcomes:

Situation	Example	Result
Field is absent	File has no `drift_rate` key at all	`MissingRequired` (if path matches `required`)
Field is null, `nullable = true`	`drift_rate: null`	Passes
Field is null, `nullable = false`	`drift_rate: null`	`NullNotAllowed`

A null value counts as “present” — the field key exists in the frontmatter, it just has no value. So null never triggers MissingRequired. An absent field is genuinely missing — it can trigger MissingRequired but never NullNotAllowed.

Note: In YAML, unquoted null is a null value, not the string "null". To store the literal string, write drift_rate: "null" (with quotes).

New fields

When mdvs check encounters a frontmatter field that isn’t in mdvs.toml — neither constrained under [[fields.field]] nor listed in ignore — it reports it as a new field.

New fields are informational only. They don’t count as violations and don’t affect the exit code:

Checked 43 files — no violations, 1 new field(s)

╭──────────────────────────────┬─────────────────────┬─────────────────────────╮
│ "algorithm"                  │ new                 │ 2 files                 │
╰──────────────────────────────┴─────────────────────┴─────────────────────────╯

They’re shown in the output so you know to either run mdvs update to add them to the schema, or add them to the ignore list.

Bare files

When include_bare_files = true in [scan], bare files (no frontmatter at all) are included in validation. Since they have no fields, they trigger MissingRequired for any required glob matching their path.

For example, if title has required = ["**"] and scratch.md is a bare file, it triggers MissingRequired for title. This is often why the inferred schema uses narrower required globs — bare files at the root prevent required = ["**"] from being inferred for fields that don’t appear in them.

Check and build

mdvs build runs the same validation internally before embedding. If any violations are found, build aborts — no dirty data reaches the index. The violations are the same ones check would report.

This means you can use check as a dry run before building, but you don’t have to — build will catch the same problems.

Exit codes

Exit code	Meaning
0	No violations (new fields don’t count)
1	One or more violations found
2	Scan or config error (couldn’t run validation)

Constraints

Constraints are validation rules that go beyond type checking. While types ensure a value is a String or Integer, constraints refine what values are actually valid — for example, restricting a String field to a specific set of allowed values.

Constraints are not a new type. They’re an optional layer on top of the existing type system. A field without constraints is validated by type alone; a field with constraints gets an additional check.

Auto-inference

During init and update reinfer, mdvs automatically detects categorical fields using a heuristic with two conditions (both must hold):

Max distinct values — the field has at most max_categories distinct values (default: 10)
Minimum repetition — total occurrences / distinct values >= min_category_repetition (default: 3)

For array fields, distinct values and occurrences are counted at the element level.

Examples

status with 3 distinct values across 30 files: distinct=3, repetition=10 — categorical
title with 28 distinct values across 30 files: distinct=28 (exceeds cap) — not categorical
author with 5 distinct values across 5 files: repetition=1 (below threshold) — not categorical

Configurable thresholds

The thresholds are configurable in [fields]:

[fields]
max_categories = 10
min_category_repetition = 3

These control automatic inference only. Manually written categories in the TOML are unaffected by thresholds.

CLI flags on update reinfer override the TOML values per-invocation:

mdvs update example_kb reinfer --max-categories 15 --min-repetition 3

Range

The range constraint restricts a numeric field’s value to an inclusive [min, max] interval. It applies to:

Integer — value must satisfy min <= value <= max
Float — same, with float comparison
Array(Integer) — each element must satisfy the range
Array(Float) — same, element-wise

Both min and max are optional — you can specify just one bound. Boolean, String, Date, DateTime, and Object fields don’t support range. Date / DateTime bounds (e.g. “published after 2024-01-01”) aren’t supported in v1 — they require JSON Schema’s formatMinimum/formatMaximum vocab and are tracked as a follow-up.

TOML representation

[[fields.field]]
name = "rating"
type = "Integer"

[fields.field.constraints]
min = 1
max = 5

Float bounds (with optional integer bound on a Float field — bounds widen to f64 for comparison):

[[fields.field]]
name = "score"
type = "Float"

[fields.field.constraints]
min = 0
max = 100

Array example — each element checked against the bounds:

[[fields.field]]
name = "ratings"
type = "Array(Integer)"

[fields.field.constraints]
min = 1
max = 10

Validation

When a value is out of bounds, check reports an OutOfRange violation with the rule (min = N, max = N) and the offending value. For arrays, the violation lists the specific elements that are out of range.

Null values follow the existing nullable logic — if nullable = true, null skips the range check.

Type rules

Bound types must match the field type:

Integer fields require integer bounds. Float bounds (e.g., min = 0.5) are rejected at config load — likely a mistake; an integer can never equal 0.5.
Float fields accept both integer and float bounds (integer bounds widen to f64).

If both bounds are present, min must be <= max — otherwise rejected at config load.

Manual overrides

Use the --with flag on update reinfer to override the default heuristic for specific fields:

# Force categorical (skip heuristic threshold)
mdvs update example_kb reinfer title --with=categorical

# Infer min/max from observed numeric values
mdvs update example_kb reinfer sample_count --with=range

# Strip all constraints
mdvs update example_kb reinfer status --with=none

--with takes a comma-separated list of constraint kinds: categorical, range, or none. Incompatible kinds (e.g., range,categorical on the same field) are rejected at parse time. --with=none cannot be combined with other kinds. The flag requires named fields.

Manual TOML edit — you can also add or remove constraints by hand. Running update (without reinfer) preserves existing constraints as-is. Only update reinfer re-evaluates them.

Length

The length constraint bounds string length or array length. It applies to:

String — min_length <= len(value) <= max_length, where length is the Unicode scalar count
Array(T) — min_length <= array length <= max_length

[[fields.field]]
name = "slug"
type = "String"

[fields.field.constraints]
min_length = 3
max_length = 64

Both bounds are optional. Integer fields, Float fields, and Boolean fields don’t support length. Length violations surface as OutOfRange. If both bounds are present, min_length <= max_length is enforced at config load.

Pattern

The pattern constraint runs a regular expression against String values:

[[fields.field]]
name = "version"
type = "String"

[fields.field.constraints]
pattern = '^v\d+\.\d+\.\d+$'

The regex is compiled at config load time — invalid syntax fails fast. Pattern is currently String-only. Pattern violations surface as WrongType (with detail naming the offending value). Categorical fields can’t also have a pattern — categories already enumerate the legal forms. Date and DateTime fields don’t accept pattern either — the type’s format is itself the pattern (see Date and DateTime).

Conflicts between constraint kinds

Some combinations are mutually exclusive on the same field:

categories + anything else — categories enumerate the legal values; other constraints would be redundant or contradictory. Rejected at config load.
range + length — range bounds numeric values; length bounds size. They apply to different field types (numeric vs. String/Array), so they should never collide in practice; the check is still enforced.

Compatible combinations: min/max together; min_length/max_length together; pattern with min_length/max_length.

Constraint kinds summary

Constraint	Field types	Violation
`categories`	String, Integer, Array(String), Array(Integer)	`InvalidCategory`
`min` / `max`	Integer, Float, Array(Integer), Array(Float)	`OutOfRange`
`min_length` / `max_length`	String, Array(T)	`OutOfRange`
`pattern`	String	`WrongType`

Each constraint kind is a key in the [fields.field.constraints] sub-table. Compatibility is checked at config load time.

Search & Indexing

mdvs builds a search index by chunking your markdown content, embedding it with a local model, and storing chunks + vectors + frontmatter in a single LanceDB dataset. Queries are served by LanceDB natively — semantic (vector), full-text (BM25), or hybrid (both, reranked) — with optional SQL filtering on frontmatter fields.

Building the index

mdvs build (or mdvs init with auto-build) creates the search index in three steps: chunk, embed, store.

Chunking

Each file’s markdown body is split into semantic chunks — respecting headings, paragraphs, and code blocks rather than cutting at arbitrary character boundaries. The maximum chunk size is configurable (default 1024 characters) via the [chunking] section in mdvs.toml:

[chunking]
max_chunk_size = 1024

Each chunk tracks its start and end line numbers in the original file, so search results can point to the exact location.

Embedding

Chunks are embedded into dense vectors using a local Model2Vec model by Minish — static embeddings that run on CPU with no external services or GPU required. The model is downloaded from HuggingFace to the local cache on first use.

[embedding_model]
provider = "model2vec"
name = "minishlab/potion-multilingual-128M"

The default is potion-multilingual-128M — 101 languages, ~480 MB on disk. The full POTION family:

Model	Parameters	Notes
`minishlab/potion-base-2M`	2M	Smallest, fastest
`minishlab/potion-base-8M`	8M	English-only, ~60 MB — good balance for English vaults
`minishlab/potion-base-32M`	32M	English-only, higher quality, slower
`minishlab/potion-retrieval-32M`	32M	English-only, optimized for retrieval tasks
`minishlab/potion-multilingual-128M`	128M	Default — 101 languages

Any Model2Vec-compatible model on HuggingFace works — set the name to its model ID. You can pin a specific revision for reproducibility.

Storage

A single Lance dataset is written to .mdvs/index.lance/ — one row per chunk, with everything you need on the same row:

Column	Purpose
`chunk_id`, `file_id`, `chunk_index`, `start_line`, `end_line`	Chunk identity and source location
`chunk_text`	The plain-text chunk body — used by the full-text index and shown as the snippet in verbose results
`embedding`	Dense vector for semantic search (`FixedSizeList<Float32>`)
`filepath`, `content_hash`, `built_at`	Per-file metadata (duplicated on each of that file’s chunks)
`data`	Frontmatter as an Arrow Struct (nested for dotted-name fields) — this is what `--where` filters query against

Inside the dataset, two indexes are built at mdvs build time:

A full-text BM25 index on chunk_text, always built.
A cosine IVF-PQ vector index on embedding, only built when the index has at least ~10,000 chunks. Smaller vaults use LanceDB’s exact flat scan, which is plenty fast at that scale.

Incremental builds

Build only re-embeds what changed. Each file’s markdown body (excluding frontmatter) is hashed, and the hash is compared against the existing index:

Classification	Condition	Action
New	File not in index	Chunk, embed, add
Edited	Hash changed	Re-chunk, re-embed, replace chunks
Unchanged	Hash matches	Keep existing chunks
Removed	In index but not on disk	Drop file and its chunks

Frontmatter-only changes (adding a tag, fixing a typo in author) rewrite the data column on every chunk row without re-embedding — the body hash hasn’t changed, so the vectors are still valid.

When nothing needs embedding, the model isn’t even loaded. When the change set is also empty (no new, edited, or removed files), the index write itself is skipped — mdvs build on an unchanged corpus does no Lance work at all. A --force flag bypasses both skips and triggers a full overwrite regardless of hashes. The non-force path that does need to persist a change is incremental: the rows for new, edited, and removed files are deleted and the freshly embedded chunks are appended, avoiding a full table rewrite.

How search works

When you run mdvs search "query" example_kb, LanceDB does the heavy lifting. The shape of the work depends on --mode (default hybrid):

semantic — the query is embedded with the same model used during build, and chunks are ranked by cosine similarity against embedding. Up to ~10,000 chunks, LanceDB does an exact flat scan; above that, the IVF-PQ vector index narrows the candidate set first.
fulltext — the query is tokenized and scored against the BM25 full-text index on chunk_text. No model load needed.
hybrid — both of the above run in parallel and their result lists are combined by LanceDB’s Reciprocal Rank Fusion reranker. Default mode because it tolerates queries that are either keyword-y or fuzzy.

For guidance on which mode to reach for, see Search Modes.

After LanceDB returns ranked chunk rows, mdvs deduplicates to the best chunk per file (a file with one highly relevant section ranks above a file with uniformly mediocre content) and then trims to --limit (default 10). LanceDB is asked for limit × 3 candidates to make sure dedupe has enough material to work with.

Scores

The score column in search output depends on the mode:

Semantic — cosine similarity, a value in roughly [0, 1] (higher = more similar).
Fulltext — BM25 relevance score, unbounded above (higher = better match).
Hybrid — RRF score, also unbounded above.

Scores depend on the mode, the model, and the content, so there’s no universal threshold for “relevant.” Compare scores relative to each other within a single query.

Filtering with `--where`

Add a SQL filter to narrow results by frontmatter fields:

mdvs search "calibration" example_kb --where "status = 'active'"

The --where clause filters on frontmatter fields — only chunks whose file matches the filter are included in the results. The filter and similarity ranking are combined in a single LanceDB query, so non-matching rows are excluded efficiently.

You can use any SQL expression that LanceDB’s filter supports:

--where "draft = false"
--where "status = 'active' AND author = 'Giulia Ferretti'"
--where "sample_count > 10"

Array fields, nested objects, and field names with special characters require specific syntax — see the Search Guide for the full reference.

Model identity

Search refuses to run if the model configured in mdvs.toml doesn’t match the model that was used to build the index. This is a hard error, not a warning.

Embeddings from different models are incompatible — cosine similarity between vectors from different models produces meaningless scores. If you change the model, rebuild the index with mdvs build --force.

Search Modes

mdvs search runs in one of three modes, controlled by --mode:

mdvs search "<query>" [path] --mode {semantic|fulltext|hybrid}

The default is hybrid. The right mode depends on what kind of question you’re asking and how confident you are about the wording.

TL;DR — which mode when

You want to find…	Pick
Something whose wording you can paraphrase but not quote	`semantic`
An exact identifier, acronym, error message, or filename	`fulltext`
Anything — let mdvs combine both signals	`hybrid` (default)

If you’re not sure, leave it on hybrid. It tends to do at least as well as either alone, at the cost of one extra index lookup that’s effectively free.

What each mode actually does

`semantic` — meaning, not words

The query is embedded into a vector with the same Model2Vec model used to build the index, and chunks are ranked by cosine similarity to that vector. Two chunks score similarly when they’re about similar things, even if they share no words.

This is the mode that does the magic:

mdvs search "how to get in touch" --mode semantic
# matches a chunk that says "reach out via Slack" with no shared words

It’s also the mode that has nothing to fall back on when your query is an acronym or a unique string that the model doesn’t have a meaningful embedding for.

`fulltext` — words, not meaning

The query is tokenized and scored against the BM25 inverted index on the persisted chunk_text column. No embedding model is loaded; this mode also works when no model has been downloaded yet.

Use it when you know the exact term you’re after:

mdvs search "SPR-A1" --mode fulltext           # exact equipment ID
mdvs search "calibration.toml" --mode fulltext # exact filename
mdvs search "TODO-0159" --mode fulltext        # exact ticket reference

BM25 doesn’t care about meaning at all. A search for "how to get in touch" in fulltext mode will only match chunks that contain some of those exact words.

`hybrid` — both, reranked

Hybrid runs both semantic and fulltext queries, then merges the two ranked lists with LanceDB’s Reciprocal Rank Fusion reranker. The result is a single ranking that promotes documents which scored well on either signal.

In practice this means:

A natural-language query that has no exact lexical matches still ranks the semantically-closest chunks at the top.
An exact-identifier query still surfaces the chunk that contains it verbatim, even if its surrounding context is semantically unremarkable.
Queries that are both — a phrase that mixes a concept with a specific term — get the best of both rankings.

Hybrid is the default because it makes the system tolerate vague queries and precise queries with the same flag.

Scores aren’t comparable across modes

The score column in the output means something different in each mode:

Mode	Score
`semantic`	Cosine similarity. Roughly `[0, 1]`. Higher = more similar in meaning.
`fulltext`	BM25 relevance score. Unbounded; depends on corpus size and term rarity. Higher = better lexical match.
`hybrid`	RRF relevance score. Unbounded but small. Higher = better.

Don’t compare scores across runs that used different modes. Within a single run, the ordering of the hits is what matters.

Performance and indexing

semantic needs the embedding model loaded. On the first run that’s a one-time ~30 MB download (default model). Subsequent runs reuse the cached model.
fulltext doesn’t need the model at all and works as soon as mdvs build has been run.
hybrid does the semantic + fulltext work in parallel; the only extra cost over semantic alone is the BM25 lookup, which is negligible at most vault sizes.

All three modes use the same Lance dataset under .mdvs/. The BM25 full-text index on chunk_text is built every time mdvs build runs; the cosine IVF-PQ vector index on embedding is built only when the index exceeds 10,000 chunks (smaller vaults rely on LanceDB’s exact flat scan, which is plenty fast at that scale). See Search & Indexing for the storage layout.

Combining with `--where`

Mode is independent of --where. Any mode can be paired with any SQL filter:

mdvs search "drift" --mode fulltext --where "status = 'active'"
mdvs search "how the project ended" --mode semantic --where "joined < '2025-01-01'"
mdvs search "calibration" --where "draft = false"          # default mode is hybrid

The filter narrows which chunks LanceDB considers; the mode decides how they’re ranked within that narrowed set. See the Search Guide for the full --where reference.

Commands

mdvs provides eight commands covering the full workflow — from schema setup to search.

Schema & validation:

init — Scan a directory, infer a typed schema, and write mdvs.toml (or import via --from-jsonschema)
check — Validate frontmatter against the schema (optionally --jsonschema to override)
update — Re-scan files, infer new fields, and update the schema
export-jsonschema — Translate mdvs.toml’s [fields] into a JSON Schema 2020-12 document

Search index:

build — Validate, embed, and write the search index
search — Query the index with natural language

Utilities:

info — Show config and index status
clean — Delete the search index

init

Scan a directory, infer a typed schema, and write mdvs.toml.

Usage

mdvs init [path] [flags]

Flags

Flag	Default	Description
`path`	`.`	Directory to scan
`--glob`	`**`	Glob pattern for matching markdown files
`--force`		Overwrite existing `mdvs.toml`
`--dry-run`		Preview the inferred schema without writing anything
`--ignore-bare-files`		Exclude files without YAML frontmatter
`--skip-gitignore`		Don’t read `.gitignore` patterns during scan
`--from-jsonschema PATH`		Import a JSON Schema file (`.json` or `.toml`) as the source of fields instead of scanning

Global flags (-o, -v, --logs) are described in Configuration.

Flags persist to `mdvs.toml`

Any flag passed to init that has a corresponding config field is persisted into the generated mdvs.toml — init is the only command where you don’t yet have a config file, so flag values become the project’s starting state. Persisted today:

Flag	Field written
`--glob "<pattern>"`	`[scan].glob`
`--ignore-bare-files`	`[scan].include_bare_files = false`
`--skip-gitignore`	`[scan].skip_gitignore = true`
`-o`, `--output <format>`	top-level `default_output_format`

So mdvs --output markdown init . produces a mdvs.toml that starts with default_output_format = "markdown" — every subsequent command in that project gets the markdown default without anyone passing -o again. Flags that don’t map to a config field (--force, --dry-run, --from-jsonschema, --verbose, --logs) remain one-shot modifiers. When a flag is absent, the corresponding field is omitted from the file (it stays at the global default).

init --force overwrites any persisted field with the new flag value.

What it does

init scans every markdown file, extracts YAML frontmatter, infers a typed schema with path patterns, and writes mdvs.toml. It does not build the search index — run build or search for that.

See Getting Started for a full walkthrough with output, and Schema Inference for how types and path patterns are computed.

One artifact is created: mdvs.toml — the schema file. Commit this to version control.

If mdvs.toml or .mdvs/ already exists, init refuses to run unless you pass --force. With --force, both mdvs.toml and .mdvs/ are deleted before proceeding. To update an existing schema without overwriting it, use update instead.

`init --force` vs `update reinfer`

Both re-infer the schema from scratch, but they differ in scope:

init --force overwrites the entire mdvs.toml — all sections, including [scan], [fields], and any build sections. Any manual edits are lost. .mdvs/ is also deleted.
update reinfer re-infers only the [fields] section. All other config is preserved.

Output

Compact (default)

mdvs init example_kb

Each discovered field is shown as its own key-value table with the field name on the top border. Only a few fields are shown here — the full output includes all 43:

Initialized 43 files — 43 field(s)

┌ action_items ────────────┬───────────────────────────────────────────────────┐
│ type                     │ Array(String)                                          │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 9 out of 43                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ meetings/all-hands/**                             │
│                          │ projects/alpha/meetings/**                        │
│                          │ projects/beta/meetings/**                         │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ meetings/**                                       │
│                          │ projects/alpha/meetings/**                        │
│                          │ projects/beta/meetings/**                         │
└──────────────────────────┴───────────────────────────────────────────────────┘

...

┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ type                     │ Float                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 3 out of 43                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ true                                              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ projects/alpha/notes/**                           │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ projects/alpha/notes/**                           │
└──────────────────────────┴───────────────────────────────────────────────────┘

...

┌ title ───────────────────┬───────────────────────────────────────────────────┐
│ type                     │ String                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 37 out of 43                                      │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ blog/**                                           │
│                          │ meetings/**                                       │
│                          │ people/**                                         │
│                          │ projects/**                                       │
│                          │ reference/protocols/**                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ blog/**                                           │
│                          │ meetings/**                                       │
│                          │ people/**                                         │
│                          │ projects/**                                       │
│                          │ reference/protocols/**                            │
└──────────────────────────┴───────────────────────────────────────────────────┘

Initialized mdvs in 'example_kb'

Each table shows the inferred type, file count, nullable status, and inferred required/allowed glob patterns. Fields with special characters in their name (e.g., lab section) include a hints row with --where syntax advice (see Search Guide).

Verbose (`-v`)

Verbose output adds pipeline timing lines before the result:

mdvs init example_kb -v

Scan: 43 files (5ms)
Infer: 43 field(s) (0ms)
Write config: example_kb/mdvs.toml (0ms)
Initialized 43 files — 43 field(s)

┌ action_items ────────────┬───────────────────────────────────────────────────┐
│ type                     │ Array(String)                                          │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 9 out of 43                                       │
...

The field tables are identical in both modes — verbose only adds the step lines showing processing times.

Examples

Preview the schema

Use --dry-run to see what init would infer without writing anything:

mdvs init example_kb --dry-run --force

Nothing is written — the output shows the same discovery table, followed by (dry run, nothing written).

Exclude bare files

By default, files without frontmatter are included in the scan. This affects field counts — a bare file at the root means title appears in 37 out of 43 files instead of 37 out of 37:

mdvs init example_kb --dry-run --force --ignore-bare-files

With --ignore-bare-files, only 37 files are scanned. The files row for title becomes 37 out of 37. This also affects the inferred required patterns — without bare files diluting the counts, more fields can be required in broader paths.

Import a JSON Schema (no scan)

--from-jsonschema PATH skips scanning and infers nothing. The file at PATH (.json or .toml) is the source of fields:

mdvs init example_kb --from-jsonschema fields.json

The schema is gated against mdvs’s supported keyword set before translation — unsupported features (oneOf, $ref, format, etc.) error out with an explanation. Path-scoping (allowed / required) and preprocessor stages are read from x-mdvs.* extension keys, so files exported via export-jsonschema round-trip losslessly.

The [scan], [embedding_model], [chunking], and [search] sections are not populated by this flow — the imported file only describes fields. Add build sections by hand or via a subsequent build.

Errors

Error	Cause
`mdvs.toml already exists`	Config exists and `--force` not passed
`is not a directory`	Path doesn’t exist or isn’t a directory
`no markdown files found`	No `.md` files match the glob pattern

check

Validate frontmatter against the schema.

Usage

mdvs check [path]

Flags

Flag	Default	Description
`path`	`.`	Directory containing `mdvs.toml`
`--no-update`		Skip auto-update before validating
`--jsonschema PATH`		Override the `[fields]` block of `mdvs.toml` with an external JSON Schema file (`.json` or `.toml`) for this run only

Global flags (-o, -v, --logs) are described in Configuration.

What it does

check reads mdvs.toml, scans every markdown file, and validates each field value against the declared constraints.

By default, check auto-updates the schema before validating (see [check].auto_update). Use --no-update to skip the update pass and validate against the committed mdvs.toml as-is — what you want in CI.

It reports seven kinds of violations:

WrongType — value doesn’t match the declared type (or fails a pattern regex)
Disallowed — field appears in a file whose path doesn’t match any allowed glob
MissingRequired — file matches a required glob but the field is absent
NullNotAllowed — field is null but nullable = false
InvalidCategory — value is not in the field’s declared categories (see Constraints)
OutOfRange — numeric value violates min/max, or length violates min_length/max_length
FrontmatterUnrepresentable — file’s YAML can’t be represented as JSON (NaN/inf, non-string keys, non-object top-level)

Fields not in mdvs.toml (and not in the ignore list) are reported as new fields — these are informational and don’t count as violations.

check is read-only — it never modifies mdvs.toml or any files. Violations are sorted deterministically by (field, kind, rule), and files within each violation are sorted by path — output is byte-stable across runs regardless of file-walking order. See Validation for the full rules, including preprocessor handling and null behavior.

Validate against an external schema

--jsonschema PATH replaces the [fields] block for this run only. Useful for one-off validation against a contract, or for cross-checking a vault against someone else’s schema:

mdvs check example_kb --jsonschema partner-contract.json

mdvs.toml is not modified. If no mdvs.toml exists, a minimal config is synthesized in memory so the rest of the pipeline runs normally.

Output

Compact (default)

When everything passes:

mdvs check example_kb

Checked 43 files — no violations

When violations are found, each violation is shown as a key-value table with the field name, violation kind, the violated rule, and the affected files:

Checked 43 files — 3 violation(s)

Violations (3):
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ kind                     │ Null value not allowed                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ not nullable                                      │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ projects/alpha/notes/experiment-2.md              │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ priority ────────────────┬───────────────────────────────────────────────────┐
│ kind                     │ Wrong type                                        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ type Integer                                      │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ projects/beta/notes/initial-findings.md (got Stri │
│                          │ ng)                                               │
│                          │ projects/beta/overview.md (got String)            │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ title ───────────────────┬───────────────────────────────────────────────────┐
│ kind                     │ Missing required                                  │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ required in ["**"]                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ README.md                                         │
│                          │ lab-values.md                                     │
│                          │ reference/glossary.md                             │
│                          │ reference/quick-start.md                          │
│                          │ reference/tools.md                                │
│                          │ scratch.md                                        │
└──────────────────────────┴───────────────────────────────────────────────────┘

WrongType violations include the actual type in parentheses (e.g., got String).

Verbose (`-v`)

Verbose output adds pipeline timing lines before the result:

Read config: example_kb/mdvs.toml (3ms)
Scan: 43 files (2ms)
Validate: 43 files — 3 violation(s) (78ms)
Checked 43 files — 3 violation(s)

Violations (3):
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ kind                     │ Null value not allowed                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ rule                     │ not nullable                                      │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ projects/alpha/notes/experiment-2.md              │
└──────────────────────────┴───────────────────────────────────────────────────┘

...

The violation tables are identical in both modes — verbose only adds the step lines showing processing times.

Exit codes

Code	Meaning
`0`	All files valid — no violations
`1`	Violations found
`2`	Pipeline error (missing `mdvs.toml`, invalid config, scan failure)

New fields don’t affect the exit code — they’re informational only.

Errors

Error	Cause
`no mdvs.toml found`	Config doesn’t exist — run `mdvs init` first
`mdvs.toml is invalid`	TOML parsing or schema error — fix the file or run `mdvs init --force`

update

Re-scan files, infer new fields, and update the schema.

Usage

mdvs update [path] [--dry-run]
mdvs update [path] reinfer [fields..] [flags]

Flags

Flag	Default	Description
`path`	`.`	Directory containing `mdvs.toml`
`--dry-run`		Preview changes without writing anything

Global flags (-o, -v, --logs) are described in Configuration.

What it does

update re-scans the directory using the existing [scan] config, infers types and path patterns from the current files, and merges the results into mdvs.toml. Unlike init, it preserves all existing configuration — only the [fields] section changes.

Default mode

By default, update only discovers new fields — fields that appear in frontmatter but aren’t yet in mdvs.toml (either as [[fields.field]] entries or in the ignore list). Existing fields are protected: their types, allowed/required patterns, nullable flags, and constraints don’t change.

Fields that disappear (no longer in any file) are kept in mdvs.toml by default. This is conservative — removing a field from the schema is an explicit action.

`reinfer` subcommand

Re-infer field definitions from scratch. This is a subcommand of update with its own flags:

Flag	Description
`fields..`	Fields to reinfer (all if none specified)
`--with <kinds>`	Comma-separated constraint kinds to apply (`categorical`, `range`, `none`). Requires named fields.
`--max-categories <N>`	Override max distinct values for categorical inference
`--min-repetition <N>`	Override min average repetition for categorical inference
`--dry-run`	Preview changes without writing anything

Reinfer specific fields:

mdvs update example_kb reinfer drift_rate priority

The named fields are removed from mdvs.toml and re-inferred from scratch, as if they’d never been seen. All other fields stay protected. Fails if a named field isn’t in mdvs.toml.

Without --with, reinfer applies the default heuristic (categorical detection — see Constraints). Use --with to override:

# Force categorical (skip heuristic threshold)
mdvs update example_kb reinfer title --with=categorical

# Infer min/max from observed numeric values
mdvs update example_kb reinfer sample_count --with=range

# Strip all constraints
mdvs update example_kb reinfer status --with=none

--with takes a comma-separated list. Incompatible kinds (e.g., range,categorical on the same field) are rejected at parse time. --with=none cannot be combined with other kinds. --with requires named fields.

Reinfer all fields:

mdvs update example_kb reinfer

When no fields are specified, all [[fields.field]] entries are removed and rebuilt from the current files. Fields that no longer exist in any file are reported as removed.

All other config sections ([scan], [embedding_model], [chunking], [search], [update]) are preserved. This is the key difference from init --force, which rewrites the entire mdvs.toml.

Output

Compact (default)

When the schema is already up to date:

Scanned 43 files — no changes (37 unchanged) (dry run)

When new fields are discovered, they appear in an “Added” section with the same key-value format as init:

Scanned 44 files — 1 field(s) changed (37 unchanged) (dry run)

Added (1):
┌ category ────────────────┬───────────────────────────────────────────────────┐
│ type                     │ String                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 3 out of 44                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ (none)                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ projects/alpha/notes/**                           │
└──────────────────────────┴───────────────────────────────────────────────────┘

When reinfer detects a type change, the “Changed” section shows old and new values with an arrow:

Scanned 43 files — 1 field(s) changed (36 unchanged)

Changed (1):
┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ type                     │ Float → String                                    │
└──────────────────────────┴───────────────────────────────────────────────────┘

When a reinferred field no longer exists in any file:

Scanned 43 files — 1 field(s) changed (36 unchanged)

Removed (1):
┌ category ────────────────┬───────────────────────────────────────────────────┐
│ previously allowed       │ projects/alpha/notes/**                           │
└──────────────────────────┴───────────────────────────────────────────────────┘

Verbose (`-v`)

Verbose output adds pipeline timing lines before the result:

Read config: example_kb/mdvs.toml (2ms)
Scan: 44 files (3ms)
Infer: 38 field(s) (0ms)
Write config: example_kb/mdvs.toml (1ms)
Scanned 44 files — 1 field(s) changed (37 unchanged)

Added (1):
┌ category ────────────────┬───────────────────────────────────────────────────┐
│ type                     │ String                                            │
...

The field tables are identical in both modes — verbose only adds the step lines showing processing times.

Exit codes

Code	Meaning
`0`	Success (changes written, or no changes needed)
`2`	Pipeline error (missing config, scan failure, build failure)

Errors

Error	Cause
`no mdvs.toml found`	Config doesn’t exist — run `mdvs init` first
`field '<name>' is not in mdvs.toml`	`reinfer` names a field that doesn’t exist
`--with requires named fields`	`--with` flag used without specifying fields
`--with: <X> and <Y> are mutually exclusive`	Incompatible constraint kinds in the same `--with` list
`--with=none cannot be combined with other kinds`	`none` mixed with other kinds in `--with`
`field name conflicts with internal column`	New field name collides with reserved names

build

Validate, embed, and write the search index.

Usage

mdvs build [path] [flags]

Flags

Flag	Default	Description
`path`	`.`	Directory containing `mdvs.toml`
`--set-model`		Change embedding model (requires `--force`)
`--set-revision`		Pin model to a specific HuggingFace revision (requires `--force`)
`--set-chunk-size`		Change max chunk size in characters (requires `--force`)
`--force`		Confirm config changes or trigger a full rebuild
`--no-update`		Skip auto-update before building

Global flags (-o, -v, --logs) are described in Configuration.

What it does

build creates (or updates) the search index in .mdvs/. The pipeline:

Read config — parse mdvs.toml. If [embedding_model], [chunking], or [search] sections are missing, they’re added with defaults and written back.

By default, build auto-updates the schema before building (see [build].auto_update). Use --no-update to validate against the committed schema (deterministic CI). The auto chain is cheap on unchanged corpora — no model load, no Lance write.

Scan — walk the directory and extract frontmatter.
Validate — check frontmatter against the schema (same as check). If violations are found, the build aborts.
Classify — compare scanned files against the existing index to determine what needs embedding, what to retain, and what to drop.
Load model — download or load the cached embedding model. Skipped if nothing needs embedding.
Embed — chunk and embed new/edited files.
Write index — branches on the change set: skip (nothing changed and not a full rebuild), full overwrite (first build or --force), or incremental (delete the rows for new/edited/removed files, append the new chunks, refresh metadata, optimize). The Lance dataset is always at .mdvs/index.lance/ with one row per chunk; a full-text BM25 index on chunk_text is rebuilt with the table; a cosine IVF-PQ vector index on embedding is created only above 10,000 chunks (smaller vaults rely on LanceDB’s exact flat scan).

See Search & Indexing for details on chunking, embedding, and how the index is structured.

Incremental builds

Build is incremental by default. It classifies each file by comparing its content hash against the existing index:

Status	Condition	Action
new	file not in existing index	chunk + embed
edited	file in index, content changed	chunk + re-embed
unchanged	file in index, content matches	keep existing chunks
removed	file in index, no longer on disk	drop from index

Content hash covers the file body only (after frontmatter extraction). Frontmatter-only changes don’t trigger re-embedding — but every chunk row is rewritten with fresh frontmatter from the current scan.

When nothing needs embedding, the model is never loaded. When the change set is empty (no new/edited/removed files), the index write itself is also skipped — mdvs build on an unchanged corpus does no Lance work at all.

Config changes

build detects when the embedding configuration has changed since the last build by comparing mdvs.toml against metadata stored on the Lance dataset. If a mismatch is found, the build refuses to proceed unless you pass --force:

config changed since last build:
  model: 'minishlab/potion-multilingual-128M' → 'minishlab/potion-base-8M'
Use --force to rebuild with new config

The same check covers schema changes. A hash of the post-translation JSON Schema is stored on the Lance dataset; if the current schema doesn’t match, the build refuses with:

schema: fields, types, constraints, path-scoping, or preprocessors have changed
Use --force to rebuild with new schema

This catches edits to [[fields.field]] definitions, constraint changes, preprocessor changes, and path-scoping changes — anything that affects what gets stored in the data column of the index.

The --set-model, --set-revision, and --set-chunk-size flags update mdvs.toml and require --force (since they change the config and trigger a full re-embed). For example, to switch to a smaller English-only model:

mdvs build --set-model minishlab/potion-base-8M --force

--set-revision pins the model to a specific HuggingFace commit SHA, ensuring reproducible embeddings even if the model is updated upstream:

mdvs build --set-revision abc123def --force

The revision is stored in mdvs.toml under [embedding_model].revision and checked against the Lance dataset metadata on subsequent builds. See Embedding for the full list of available models.

On the first build (no existing .mdvs/), --force is never needed.

Output

Compact (default)

When nothing needs embedding (incremental build, all files unchanged):

Built index — 43 files, 59 chunks

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ full rebuild             │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files total              │ 43                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files embedded           │ 0                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files unchanged          │ 43                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed            │ 0                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks total             │ 59                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks embedded          │ 0                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks unchanged         │ 59                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunks removed           │ 0                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ new fields               │ (none)                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ embedded files           │ (none)                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ removed files            │ (none)                                            │
└──────────────────────────┴───────────────────────────────────────────────────┘

When violations are found, the build aborts:

Build aborted — 6 violation(s) found. Run `mdvs check` for details.

Verbose (`-v`)

Verbose output adds pipeline timing lines before the result. Steps that didn’t need to run (model load on an unchanged corpus, the index write itself when nothing changed) are silently elided from the text output, but appear as "status": "skipped" in --output json. A full-rebuild verbose run:

Read config: example_kb/mdvs.toml (4ms)
Scan: 43 files (4ms)
Infer: 37 field(s) (0ms)
Validate: 43 files — no violations (87ms)
Classify: 43 to embed, 0 unchanged, 0 removed (0ms)
Load model: minishlab/potion-multilingual-128M (24ms)
Embed: 43 files, 59 chunks (12ms)
Write index: 43 files, 59 chunks (1ms)
Built index — 43 files, 59 chunks (full rebuild)

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ full rebuild             │ true                                              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files total              │ 43                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files embedded           │ 43                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files unchanged          │ 0                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ ...                                                                          │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ embedded files           │ README.md (7 chunks)                              │
│                          │ blog/drafts/grant-ideas.md (2 chunks)             │
│                          │ ...                                               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ removed files            │ (none)                                            │
└──────────────────────────┴───────────────────────────────────────────────────┘

The key-value table is identical in both modes — verbose only adds the step lines showing processing times. When files are embedded, the embedded files row lists each file with its chunk count.

Exit codes

Code	Meaning
`0`	Build completed successfully
`1`	Violations found — build aborted
`2`	Pipeline error (missing config, scan failure, config mismatch, model failure)

Errors

Error	Cause
`no mdvs.toml found`	Config doesn’t exist — run `mdvs init` first
`config changed since last build`	Config differs from Lance dataset metadata — use `--force`
`--set-model requires --force`	Changing model triggers full re-embed
`--set-chunk-size requires --force`	Changing chunk size triggers full re-embed
`dimension mismatch`	Model produces different dimensions than existing index (incremental build only — `--force` bypasses this)

search

Query the index with natural language.

Usage

mdvs search <query> [path] [flags]

Flags

Flag	Default	Description
`query`	(required)	Natural language search query
`path`	`.`	Directory containing `mdvs.toml`
`--mode`	`hybrid`	Search mode: `semantic`, `fulltext`, or `hybrid`
`--limit` / `-n`	`10`	Maximum number of results
`--where`		SQL WHERE clause — filter on frontmatter fields or on the `filepath` column
`--no-update`		Skip auto-update
`--no-build`		Skip auto-build before searching

The default limit can be changed in mdvs.toml via [search].default_limit.

Global flags (-o, -v, --logs) are described in Configuration.

What it does

search loads the Lance index from .mdvs/, runs the query through LanceDB, and ranks files by their best-matching chunk. The exact ranking depends on --mode:

semantic — embed the query with the same model that built the index, cosine-rank chunks against embedding.
fulltext — BM25 rank chunks against the persisted chunk_text (no model load needed).
hybrid (default) — run both and combine with LanceDB’s Reciprocal Rank Fusion reranker.

Each file’s score is the best chunk match across all its chunks (see scoring). Results are sorted descending (higher = better match).

By default, search auto-builds the index before querying, which includes auto-updating the schema (see [search].auto_build). The chain is cheap on unchanged corpora — update is fast, classify sees no work, and the Lance write is skipped. Use --no-build to query a pre-built index without touching it (deterministic CI search), or --no-update to build against the committed schema.

See Search & Indexing for details on chunking, embedding, scoring, and model identity.

First run

Note: The very first time search (or build) runs, mdvs downloads the embedding model from HuggingFace to a local cache. This is a one-time download — subsequent runs use the cached model and start instantly.

Download size depends on the model:

Model Size

potion-base-2M ~8 MB

potion-base-8M ~30 MB

potion-base-32M ~120 MB

potion-multilingual-128M (default) ~480 MB

After the model is cached, a full build of 500+ files completes in under a second.

Model	Size
`potion-base-2M`	~8 MB
`potion-base-8M`	~30 MB
`potion-base-32M`	~120 MB
`potion-multilingual-128M` (default)	~480 MB

`--where`

Filter results using SQL syntax. The filter and similarity ranking are combined in a single query, so files that don’t match are excluded efficiently. --where operates on any column in the Lance index — frontmatter fields (auto-discovered from mdvs.toml) and the always-present filepath column.

Scalar frontmatter comparisons:

mdvs search "experiment" --where "status = 'active'"
mdvs search "experiment" --where "sample_count > 20"
mdvs search "experiment" --where "status = 'active' AND priority = 1"

Array fields — = / != / IN / NOT IN are auto-rewritten to array_has(...):

mdvs search "calibration" --where "tags = 'biosensor'"               # auto-rewritten
mdvs search "calibration" --where "tags IN ('biosensor', 'optics')"  # OR-chain of array_has

The translation note at the top of the result shows the rewrite. --where clauses that reference Array(Float) fields are rejected up front with a clear error — see the Search Guide for the workaround.

Path filtering via the always-present filepath column (its last component is the filename):

mdvs search "race condition" --where "filepath LIKE 'logs/%'"        # everything under logs/
mdvs search "review" --where "filepath LIKE '%-postmortem.md'"       # filename suffix
mdvs search "alpha" --where "filepath = 'projects/alpha/overview.md'" # exact path
mdvs search "deploy" --where "filepath LIKE 'logs/%' AND status = 'published'"  # combine

Field names with spaces need double-quoting:

mdvs search "query" --where "\"lab section\" = 'optics'"

See Search Guide for the full --where reference, including nested objects, escaping rules, and more examples.

Output

Compact (default)

mdvs search "experiment" example_kb -n 3

A header table shows the query metadata, followed by one key-value table per hit numbered #1, #2, etc. Each hit includes the file, similarity score, line range, and the best-matching chunk text:

Searched "experiment" — 3 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ experiment                                        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 3                                                 │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/archived/gamma/lessons-learned.md        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.487                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines                    │ 26-28                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text                     │ ## On REMO                                        │
│                          │                                                   │
│                          │ REMO's environmental monitoring data from the out │
│                          │ door tests was the most useful output of the enti │
│                          │ re project. ...                                   │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #2 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ blog/published/2031/founding-story.md             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.470                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines                    │ 21-21                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text                     │ We are a small lab and we intend to stay small... │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #3 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/archived/gamma/post-mortem.md            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.457                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ lines                    │ 11-21                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ text                     │ # Project Gamma — Post-Mortem ...                 │
└──────────────────────────┴───────────────────────────────────────────────────┘

With --where filtering, only files matching the SQL clause are included:

mdvs search "experiment" example_kb --where "status = 'active'" -n 5

Searched "experiment" — 3 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ experiment                                        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 5                                                 │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/alpha/overview.md                        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.391                                             │
...

Verbose (`-v`)

Verbose output adds pipeline timing lines before the result:

mdvs search "experiment" example_kb -v -n 3

Read config: example_kb/mdvs.toml (2ms)
Scan: 43 files (2ms)
...
Load model: minishlab/potion-multilingual-128M (22ms)
Embed query: "experiment" (0ms)
Execute search: 3 hits (5ms)
Searched "experiment" — 3 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ experiment                                        │
...

The hit tables are identical in both modes — verbose only adds the step lines showing processing times.

Exit codes

Code	Meaning
`0`	Search completed (even with 0 results)
`2`	Pipeline error (missing config, missing index, model mismatch, invalid `--where`)

Errors

Error	Cause
`no mdvs.toml found`	Config doesn’t exist — run `mdvs init` first
`index not found`	`.mdvs/` doesn’t exist — run `mdvs build` first
`model mismatch`	Config model differs from index — run `mdvs build` to rebuild
Invalid `--where`	SQL syntax error or unknown field name

info

Show config and index status.

Usage

mdvs info [path]

Flags

Flag	Default	Description
`path`	`.`	Directory containing `mdvs.toml`

Global flags (-o, -v, --logs) are described in Configuration.

What it does

info reads mdvs.toml, counts files on disk, and reads the index metadata from .mdvs/ (if it exists). It displays the current schema and index status without modifying anything.

Use it to check which fields are configured, whether the index is up to date, or if the config has changed since the last build.

Output

Compact (default)

mdvs info example_kb

The output is organized into sections: Config, Index (if built), and one key-value table per field. Only a few fields are shown here:

43 files, 43 fields, 59 chunks

Config:
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ scan glob                │ **                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ ignored fields           │ (none)                                            │
└──────────────────────────┴───────────────────────────────────────────────────┘

Index:
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ revision                 │ none                                              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ chunk size               │ 1024                                              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ built                    │ 2026-03-29T15:22:21.347671+00:00                  │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ config                   │ match                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 43 out of 43                                      │
└──────────────────────────┴───────────────────────────────────────────────────┘

43 fields:
┌ action_items ────────────┬───────────────────────────────────────────────────┐
│ type                     │ Array(String)                                     │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 9 out of 43                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ meetings/all-hands/**                             │
│                          │ projects/alpha/meetings/**                        │
│                          │ projects/beta/meetings/**                         │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ meetings/**                                       │
│                          │ projects/alpha/meetings/**                        │
│                          │ projects/beta/meetings/**                         │
└──────────────────────────┴───────────────────────────────────────────────────┘

...

┌ drift_rate ──────────────┬───────────────────────────────────────────────────┐
│ type                     │ Float                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files                    │ 3 out of 43                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ nullable                 │ true                                              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ required                 │ projects/alpha/notes/**                           │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ allowed                  │ projects/alpha/notes/**                           │
└──────────────────────────┴───────────────────────────────────────────────────┘

...

The config row shows match when mdvs.toml matches the index metadata, or changed when the config has been modified since the last build. The files row shows indexed files vs files on disk.

When no index has been built:

43 files, 43 fields

Config:
┌──────────────────────────┬───────────────────────────────────────────────────┐
│ scan glob                │ **                                                │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ ignored fields           │ (none)                                            │
└──────────────────────────┴───────────────────────────────────────────────────┘

43 fields:
...

The Index section is omitted and the summary shows only files and fields (no chunk count).

Verbose (`-v`)

Verbose output adds pipeline timing lines before the result:

Read config: example_kb/mdvs.toml (2ms)
Scan: 43 files (3ms)
Read index: 43 files, 59 chunks (2ms)
43 files, 43 fields, 59 chunks

Config:
...

The tables are identical in both modes — verbose only adds the step lines showing processing times.

Exit codes

Code	Meaning
`0`	Success (including when no index exists)
`2`	Pipeline error (missing config, Lance dataset read failure)

Errors

Error	Cause
`no mdvs.toml found`	Config doesn’t exist — run `mdvs init` first

clean

Delete the search index.

Usage

mdvs clean [path]

Flags

Flag	Default	Description
`path`	`.`	Directory containing `mdvs.toml`

Global flags (-o, -v, --logs) are described in Configuration.

What it does

clean deletes the .mdvs/ directory, which contains the Lance dataset that makes up the search index (plus the cached embedding model). The mdvs.toml configuration file is never touched — you can rebuild the index at any time with build.

The command is idempotent — running it when .mdvs/ doesn’t exist is a no-op. It also refuses to delete if .mdvs/ is a symlink, as a safety measure.

Output

Compact (default)

mdvs clean example_kb

Cleaned "example_kb/.mdvs"

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ removed                  │ true                                              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ path                     │ example_kb/.mdvs                                  │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed            │ 2                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ size                     │ 113.7 KB                                          │
└──────────────────────────┴───────────────────────────────────────────────────┘

When there’s nothing to clean:

Nothing to clean — "example_kb/.mdvs" does not exist

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ removed                  │ false                                             │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ path                     │ example_kb/.mdvs                                  │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed            │ 0                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ size                     │ 0 B                                               │
└──────────────────────────┴───────────────────────────────────────────────────┘

Verbose (`-v`)

Verbose output adds pipeline timing lines before the result:

Delete index: example_kb/.mdvs (2 files, 113.8 KB) (0ms)
Cleaned "example_kb/.mdvs"

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ removed                  │ true                                              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ path                     │ example_kb/.mdvs                                  │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ files removed            │ 2                                                 │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ size                     │ 113.8 KB                                          │
└──────────────────────────┴───────────────────────────────────────────────────┘

Exit codes

Code	Meaning
`0`	Success (including when nothing to clean)
`2`	Pipeline error (symlink detected, I/O failure)

Errors

Error	Cause
`.mdvs is a symlink`	Refuses to delete symlinks for safety — remove it manually

export-jsonschema

Translate mdvs.toml’s [fields] block into a JSON Schema 2020-12 document.

Usage

mdvs export-jsonschema [path] [flags]

Flags

Flag	Default	Description
`path`	`.`	Directory containing `mdvs.toml`
`--format json\|toml`	`json`	Output format. `toml` produces a TOML serialization of the same JSON Schema
`--output-file FILE`	(stdout)	Write to a file instead of stdout

Global flags (-o, -v, --logs) are described in Configuration.

What it does

export-jsonschema reads mdvs.toml, takes the [fields] block, and translates it into a JSON Schema 2020-12 document. Field types, constraints, path-scoping, and preprocessor stages are all preserved. The output is a valid JSON Schema that any standards-compliant validator can consume.

Build configuration ([embedding_model], [chunking], [search]) and scan settings are not included — JSON Schema only describes the field contract.

Round-tripping with init

export-jsonschema and init --from-jsonschema are designed to round-trip losslessly:

mdvs export-jsonschema ./project --output-file fields.json
mdvs init ./reborn --from-jsonschema fields.json

The new mdvs.toml reproduces the original [[fields.field]] definitions:

Field types (strict — String is String, not a permissive set)
Constraints (categories, min/max, min_length/max_length, pattern)
Path-scoping (allowed, required)
Preprocessor arrays (preprocess = ["coerce-to-string"], etc.)
The [fields].ignore list

mdvs-specific metadata that JSON Schema 2020-12 doesn’t model is carried in x-mdvs.* extension keys; generic JSON Schema validators ignore them, and init --from-jsonschema reads them back.

Examples

Export to stdout (pipeable)

mdvs export-jsonschema example_kb | jq '.properties | keys'

When writing to stdout, the summary banner is suppressed so the output is directly pipeable.

Export to a file

mdvs export-jsonschema example_kb --output-file fields.json

Exported schema for 37 field(s) → fields.json (json)

Export as TOML

mdvs export-jsonschema example_kb --format toml --output-file fields.toml

The TOML output is the same JSON Schema, serialized via the workspace tomljson crate. It’s interchangeable with the JSON form — init --from-jsonschema fields.toml produces the same result as the JSON file.

Errors

Error	Cause
`no mdvs.toml found`	Config doesn’t exist — run `mdvs init` first
`mdvs.toml is invalid`	TOML parsing or schema error
`failed to write`	Output file path is not writable

Recipes

Walkthroughs for pointing mdvs at common markdown ecosystems and the agents that work in them.

Agent harnesses — Wiring mdvs into Claude Code, Codex, Cursor, OpenCode, Antigravity. Skill file, project-rules snippet, validate-on-write hook, search-nudge hook. Per-platform pages for copy-paste install; overview page for the architecture and how to extend to a new harness.
Obsidian — YAML-frontmatter vaults, .mdvsignore patterns, Dataview caveats, common validation setups
Hugo — Mixed-format sites (YAML / TOML / JSON), native TOML date queries, forced-format mode for opinionated repos
CI — Running mdvs check in a pipeline as a frontmatter linter

Agent harnesses

mdvs ships agent integration in three pieces:

A skill (the Agent Skills standard) — works in any harness that loads .md skills.
A project-rules snippet — works in any harness that reads AGENTS.md / CLAUDE.md / .cursor/rules.
A PostToolUse hook that calls mdvs hook handle — only verified end-to-end on Claude Code today.

Per-harness install steps in the left nav.

How violations reach the agent (Claude Code)

When the agent edits a markdown file in your vault:

The harness’s PostToolUse hook fires the configured mdvs hook handle command.
mdvs reads the tool-call payload, walks up to find mdvs.toml. If the edit happened outside any vault, the hook stays silent.
mdvs runs check on the vault. If the file is clean, the hook stays silent (no noise on the happy path).
If there are violations, mdvs writes a Claude-Code-shaped envelope JSON to stdout. The harness reads it and surfaces the markdown body to the agent through additionalContext and the pretty render to the user through systemMessage.
The agent sees the violation and reacts on its next turn — per the schema-evolution loop: if it’s a mistake, fix the file; if it’s intentional (KB evolving), surface the deviation to the user and propose updating mdvs.toml.

A separate search-nudge hook fires after every Bash command that runs grep / rg / find / ag / ack / fd / git grep. If the agent’s cwd is inside an mdvs vault, the hook surfaces a one-line tip suggesting mdvs search. Like validate, it’s non-blocking — the agent decides whether to switch tools.

Per-platform support

Platform	Skill	Snippet	Hooks
Claude Code	✓	✓	✓
Codex	✓	✓	see Codex hooks docs
Cursor	✓	✓	see Cursor hooks docs
OpenCode	✓	✓	see OpenCode docs
Antigravity	✓	✓	see Gemini CLI hooks docs

Pre-commit hook

A pre-commit hook is a script git runs locally before each git commit — if it exits non-zero, the commit is blocked. The community pre-commit tool manages hooks declaratively per-repo via a YAML config; mdvs plugs into it as a one-line entry.

Running mdvs check as a pre-commit hook catches frontmatter violations before they reach the repo, regardless of how the file was edited — agent, IDE, or by hand. It’s the simplest harness-independent safety net, and the recommended fallback for harnesses where the post-edit hook isn’t wired up.

Install

To install the pre-commit tool on your machine check the docs at this link.

It’s also possible to install pre-commit using uv:

uv tool install pre-commit

Configure

In your mdvs vault, create .pre-commit-config.yaml:

repos:
  - repo: local
    hooks:
      - id: mdvs-check
        name: mdvs check
        entry: mdvs check --no-update
        language: system
        pass_filenames: false

Activate the hook in this repo (writes .git/hooks/pre-commit):

pre-commit install

That’s it. The next git commit runs mdvs check; if there are violations the commit aborts and the violation report is printed. To run the check manually without committing:

pre-commit run --all-files

Notes

Works with any install method. language: system just runs the mdvs already on your PATH — it doesn’t matter whether you installed via cargo install mdvs, the release shell installer, Homebrew, or a manually-placed binary. The only requirement is that mdvs is invocable from git’s environment.
PATH gotcha for GUI git clients. git pre-commit hooks fire under git’s environment, which isn’t always the same as your interactive shell’s PATH. If mdvs lives in ~/.cargo/bin/ and you commit from a GUI client that doesn’t inherit your shell PATH, the hook fails with mdvs: command not found. Either commit from the terminal, or use the absolute path in entry: (entry: /Users/you/.cargo/bin/mdvs check --no-update).
Version-pinned alternative. To have pre-commit fetch mdvs into its own isolated environment (slower per-repo install, but reproducible across machines and CI), swap to language: rust and additional_dependencies: ["mdvs"].
--no-update tells mdvs check not to auto-update mdvs.toml from inferred new fields. The hook validates against the committed schema; schema evolution stays an explicit user action.
pass_filenames: false because mdvs check runs against the whole vault, not file-by-file. The same validation pass covers every staged change in one shot.

For CI-side validation (catches violations even if a contributor skipped the local hook), see the CI recipe.

Antigravity

Install

mkdir -p .agents/skills/mdvs
mdvs scaffold skill > .agents/skills/mdvs/SKILL.md
mdvs scaffold snippet --platform antigravity >> AGENTS.md

Antigravity reads skills from .agents/skills/<name>/SKILL.md (the cross-harness Agent Skills convention — same path Codex uses) and reads project rules from AGENTS.md at the workspace root.

What you get

Skill: agent learns when to call which mdvs command, how to interpret violations, and the schema-evolution loop. Loaded by Antigravity on session start.
Snippet: always-on project-rules block telling the agent to prefer mdvs search over Grep for KB lookups.

Hooks

mdvs doesn’t ship a verified Antigravity hook config. Antigravity inherits parts of its configuration sources from Gemini CLI, so the hooks system should be compatible with what’s documented at the Gemini CLI hooks reference.

As a harness-independent fallback, the pre-commit hook runs mdvs check on every commit.

Per-platform notes

Skill install path: .agents/skills/mdvs/SKILL.md. Project-scoped path; the agent picks it up when working in the directory.
AGENTS.md: documented as the post-rebrand convention. Legacy Gemini CLI sessions also recognized GEMINI.md; both still appear in flux.

Sources

Claude Code

Install

Three commands. Run each in your project root:

mkdir -p .claude/skills/mdvs
mdvs scaffold skill > .claude/skills/mdvs/SKILL.md
mdvs scaffold snippet --platform claude-code >> CLAUDE.md
mdvs scaffold hook --platform claude-code >> .claude/settings.json   # merge into existing hooks

The last command emits a JSON snippet — if .claude/settings.json already exists with other settings, merge by hand instead of appending blindly: the hooks.PostToolUse array should be unioned with anything you already have. mdvs’s emitted snippet self-documents the merge target in a _comment field at the top.

What you get

Skill: agent learns when to call which mdvs command, how to interpret violations, and the schema-evolution loop. Loaded on session start; activated by description-match or directly via /mdvs.
Snippet: always-on CLAUDE.md block telling the agent to prefer mdvs search over Grep for KB lookups.
Validate hook: after every Edit / Write / MultiEdit on a markdown file inside an mdvs vault, mdvs hook handle runs check and surfaces violations through additionalContext (agent-visible) and systemMessage (user-visible, capped at 15 lines). Hook always exits 0 — never blocks.
Search-nudge hook: after every Bash command that runs grep / rg / find / etc., if the agent’s cwd is in an mdvs vault, surfaces a one-line tip pointing at mdvs search.

Per-platform notes

Skill path: .claude/skills/mdvs/SKILL.md (Claude Code reads only from .claude/skills/, not the cross-harness .agents/skills/).
Project rules: CLAUDE.md at workspace root.
Hook envelope: Claude Code’s hookSpecificOutput.additionalContext + systemMessage shape. PascalCase event name (PostToolUse).
mdvs on PATH: the hook command is mdvs hook handle --platform claude-code --kind <kind>.

Sources

Codex

Install

mkdir -p .agents/skills/mdvs
mdvs scaffold skill > .agents/skills/mdvs/SKILL.md
mdvs scaffold snippet --platform codex >> AGENTS.md

What you get

Skill: agent learns when to call which mdvs command, how to interpret violations, and the schema-evolution loop. Loaded from .agents/skills/mdvs/SKILL.md (the cross-harness Agent Skills convention).
Snippet: always-on AGENTS.md block telling the agent to prefer mdvs search over Grep.

Hooks

mdvs doesn’t ship a verified Codex hook config. To wire mdvs hook handle into Codex’s PostToolUse mechanism, follow the Codex hooks docs.

As a harness-independent fallback, the pre-commit hook runs mdvs check on every commit.

Per-platform notes

Skill path: .agents/skills/mdvs/SKILL.md (Codex’s canonical path per their skills docs — same path Cursor and Antigravity also honor).
Project rules: AGENTS.md at workspace root. AGENTS.override.md takes precedence if present.
mdvs on PATH: mdvs must be available to any subprocess Codex runs.

Sources

Cursor

Install

mkdir -p .cursor/skills/mdvs .cursor/rules
mdvs scaffold skill > .cursor/skills/mdvs/SKILL.md
mdvs scaffold snippet --platform cursor > .cursor/rules/mdvs.mdc

What you get

Skill: agent learns when to call which mdvs command, how to interpret violations, and the schema-evolution loop. Cursor reads skills from .cursor/skills/, .agents/skills/, .claude/skills/, and .codex/skills/ — mdvs scaffold skill --platform cursor writes to .cursor/skills/ as the native path.
Snippet: .cursor/rules/mdvs.mdc with alwaysApply: true. Cursor includes it in every conversation automatically.

Hooks

mdvs doesn’t ship a verified Cursor hook config. To wire mdvs hook handle into Cursor’s PostToolUse mechanism, follow the Cursor hooks docs.

As a harness-independent fallback, the pre-commit hook runs mdvs check on every commit.

Per-platform notes

Project rules: Cursor also honors AGENTS.md at workspace root. If you’d rather paste the universal snippet there: mdvs scaffold snippet >> AGENTS.md (without --platform).
mdvs on PATH: mdvs must be available to any subprocess Cursor runs. On macOS, Cursor launched from Spotlight may not see your shell PATH — symlink mdvs into /usr/local/bin/ or install via cargo install --path crates/mdvs.

Sources

OpenCode

Install

mkdir -p .opencode/skills/mdvs
mdvs scaffold skill > .opencode/skills/mdvs/SKILL.md
mdvs scaffold snippet --platform opencode >> AGENTS.md

What you get

Skill: agent learns when to call which mdvs command, how to interpret violations, and the schema-evolution loop. Loaded by OpenCode on session start.
Snippet: always-on AGENTS.md block telling the agent to prefer mdvs search over Grep.

Hooks

OpenCode handles tool events through a TypeScript plugin API rather than shell-command hooks. At the moment, mdvs doesn’t ship a verified plugin or hook config for OpenCode, to wire mdvs hook handle into OpenCode’s plugin events, follow the OpenCode docs.

As a harness-independent fallback, the pre-commit hook runs mdvs check on every commit.

Per-platform notes

Skill path: .opencode/skills/mdvs/SKILL.md (native). OpenCode also reads .claude/skills/ and .agents/skills/ — you can symlink across if you want a single source of truth shared with another harness.
Project rules: AGENTS.md at workspace root. OpenCode also reads CLAUDE.md as a Claude Code-compat fallback.

Sources

Obsidian

mdvs works well with Obsidian vaults — it validates your YAML frontmatter for consistency and provides semantic search across all your notes. Everything runs locally, no external services needed. (Obsidian emits YAML; mdvs also handles TOML and JSON if you’ve imported notes from other tools — see the Hugo recipe for the mixed-format case.)

Setup

Point mdvs at your vault:

mdvs init path/to/vault

This scans all markdown files, infers a typed schema from your frontmatter, and writes mdvs.toml. If auto-build is enabled (the default), it also downloads the embedding model and builds the search index.

Two artifacts are created:

mdvs.toml — commit this to version control
.mdvs/ — add to .gitignore (search index, can be rebuilt)

.gitignore

mdvs respects .gitignore by default. If your vault has .obsidian/ in .gitignore (many do), those files are automatically excluded from scanning. No extra configuration needed.

.mdvsignore

For additional exclusions, create a .mdvsignore file at the vault root. It uses the same syntax as .gitignore:

# AI working directories
.claude/
.gemini/

# Template files (if using Templater)
_templates/

# Attachments (no frontmatter)
attachments/
assets/

Any directory that doesn’t contain markdown with frontmatter is a good candidate for exclusion — it speeds up scanning and avoids noise in the schema.

Common frontmatter patterns

Obsidian vaults typically use frontmatter like:

---
title: My Note
tags: [project, research]
status: active
date: 2026-03-14
draft: false
---

mdvs infers types automatically:

Field	Inferred type	Notes
`title`	String
`tags`	Array(String)	Array of strings
`status`	String
`date`	Date	RFC 3339 `YYYY-MM-DD` strings auto-promote to `Date`; mixed shapes fall back to `String`
`draft`	Boolean

Inconsistent types

If the same field has different types across notes (e.g., priority is an integer in some files and a string like "high" in others), mdvs widens to the broadest compatible type — usually String. See Types & Widening for the full rules.

Dataview fields

If you use the Dataview plugin, its inline fields (e.g., key:: value) are not picked up by mdvs — only YAML frontmatter between --- fences is scanned. Dataview fields that appear in the YAML block are handled normally.

Validation

Once mdvs.toml exists, use check to verify your frontmatter:

mdvs check path/to/vault

This catches:

Wrong types — a Boolean field with a string value
Missing required fields — a field that should be present in certain directories
Disallowed fields — a field appearing where it shouldn’t
Null violations — null where it’s not allowed

See Validation for the full rules.

Tightening constraints

The inferred schema is permissive by default. To enforce stricter rules, edit mdvs.toml directly. For example, to require tags in all daily notes:

[[fields.field]]
name = "tags"
type = "Array(String)"
allowed = ["**"]
required = ["daily/**"]
nullable = false

Updating the schema

When you introduce new frontmatter fields, run update to incorporate them:

mdvs update path/to/vault

This discovers new fields and adds them to mdvs.toml without touching existing field definitions. Use the reinfer subcommand to re-infer specific fields if you’ve reorganized your vault.

Search

Build the index and search:

mdvs build path/to/vault
mdvs search "topic of interest" path/to/vault

Filter with --where on your frontmatter:

# Only active notes
mdvs search "topic" path/to/vault --where "status = 'active'"

# Notes with a specific tag
mdvs search "topic" path/to/vault --where "array_has(tags, 'research')"

# Notes in a specific directory
mdvs search "topic" path/to/vault --where "filepath LIKE 'projects/%'"

See the Search Guide for the full --where reference.

Tips

Incremental builds — only notes whose body changed since the last build are re-embedded. Frontmatter-only changes (updating tags, status) don’t trigger re-embedding. Run mdvs build freely — on an unchanged vault the index write itself is skipped, so it’s effectively a no-op.
Alongside Obsidian search — mdvs search is semantic (finds conceptually related notes), while Obsidian’s built-in search is keyword-based. They complement each other.
Large vaults — mdvs has been tested on vaults of over 1,500 files; full build from scratch finishes in single-digit seconds, and incremental builds touching one or two files complete in tens of milliseconds. See docs/benchmarks/ for measured numbers.
Ignore noisy fields — if some frontmatter fields are auto-generated and you don’t want to validate them, add them to the ignore list in mdvs.toml:
```
[fields]
ignore = ["cssclass", "kanban-plugin"]
```

Hugo

mdvs works directly on a Hugo site’s content/ tree. Hugo accepts YAML (---), TOML (+++), and JSON ({...}) frontmatter; mdvs accepts the same three formats and auto-detects per file, so it doesn’t matter which convention your site uses — or whether you’ve drifted across formats over time.

Setup

Point mdvs at the content/ directory:

mdvs init path/to/site/content

This scans every markdown file, infers a typed schema from the frontmatter (across all three formats), and writes mdvs.toml alongside. If auto-build is enabled (the default), it also downloads the embedding model and builds the search index under .mdvs/.

Two artifacts are created next to content/:

mdvs.toml — commit to version control
.mdvs/ — add to .gitignore (search index, regenerable)

Some Hugo sites prefer to keep the schema and index alongside the site root rather than inside content/. In that case, run mdvs init . from the site root and use a glob:

[scan]
glob = "content/**"

Mixed-format vaults

Hugo’s docs show all three frontmatter formats interchangeably, and real-world sites often end up with a mix — an older --- post sitting next to a newer +++ post and an occasional {...} block emitted by a content tool. mdvs handles this transparently. A single mdvs.toml is inferred across all three formats; the same title, tags, draft fields collapse into one schema regardless of where they were written.

You can verify this with mdvs check after init:

$ mdvs check
Checked 142 files — no violations

Forcing a single format

If your site is opinionated about TOML (Hugo’s default for hugo new), tell mdvs:

[scan]
frontmatter_format = "toml"

Now any file that uses --- (YAML) or { (JSON) raises a FrontmatterUnrepresentable error during check, naming both the configured and detected delimiters. Useful when you want your CI to fail loudly if someone drops in a YAML post by accident.

Native TOML dates

Hugo’s TOML frontmatter often uses native Date / DateTime literals — unquoted, e.g.:

+++
title = "Launching v2"
date = 2024-09-01
publishedAt = 2024-09-01T09:00:00Z
+++

mdvs recognizes both as typed fields: date becomes FieldType::Date, publishedAt becomes FieldType::DateTime. No special configuration. You can then filter on them in search:

mdvs search "release notes" --where "publishedAt > '2024-01-01T00:00:00Z'"

Useful queries

Once the index is built, common Hugo-site workflows become one-liners:

Find drafts that have been sitting around:

mdvs search "" --where "draft = true" --output json

Posts in a particular taxonomy:

mdvs search "machine learning roundup" --where "'ml' = ANY(tags)"

Posts authored by a specific contributor in a date range:

mdvs search "authentication" \
  --where "author = 'alice' AND date >= '2024-01-01' AND date < '2024-04-01'"

The --where clause is SQL against your frontmatter — anything you can express as a column reference works. See the Search Guide for details.

Validating across an editorial workflow

Add mdvs check to your Hugo build pipeline so frontmatter drift fails CI:

# .github/workflows/build.yml (excerpt)
- name: Validate frontmatter
  run: mdvs check
- name: Build site
  run: hugo --minify

mdvs check returns exit code 1 if any file violates the schema (missing required field, wrong type, etc.), which is enough to break the build. The exact same mdvs.toml validates YAML, TOML, and JSON files uniformly — no per-format duplicate rules.

See the CI recipe for a more general-purpose CI workflow.

CI

mdvs check exits with code 1 when any file violates the schema, so it slots straight into a CI pipeline as a frontmatter linter. This page covers the GitHub Actions case, but the same shape works on GitLab CI, CircleCI, or any runner that can install a binary and run a command.

Minimal GitHub Actions workflow

# .github/workflows/check-frontmatter.yml
name: Frontmatter check

on:
  push:
    branches: [main]
  pull_request:

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install mdvs
        env:
          MDVS_VERSION: vX.Y.Z   # pin to a specific release — see below
        run: |
          curl --proto '=https' --tlsv1.2 -LsSf \
            "https://github.com/edochi/mdvs/releases/download/${MDVS_VERSION}/mdvs-installer.sh" | sh
          echo "$HOME/.cargo/bin" >> $GITHUB_PATH

      - name: Validate frontmatter
        run: mdvs check --no-update

Replace vX.Y.Z with a real release tag (see the releases page). This adds a check that runs on every PR and every push to main. If a contributor introduces a file with a wrong type, missing required field, disallowed field, or unrepresentable frontmatter, the job fails and the PR is blocked until it’s fixed.

Pin the mdvs version

The installer URL above pulls a specific release tag. GitHub also exposes a releases/latest/download/... URL that always redirects to the newest release — convenient for casual use, and that’s what the README install snippet uses — but in CI you want reproducibility. Pinning a specific tag means a green check today still passes (or still fails the same way) tomorrow, regardless of what mdvs ships in the meantime.

Bump the pinned version when you’re ready to adopt new validation behavior. The mdvs release notes call out anything that affects validation output.

`--no-update` for deterministic CI

The --no-update flag (or [check].auto_update = false in mdvs.toml) tells check to validate strictly against the committed schema instead of re-running inference first. This matters in CI:

With auto-update on: a PR that adds a new frontmatter field will pass because check re-infers the schema and silently includes the new field. The unintended addition slips through.
With --no-update: the same PR fails with a Disallowed violation because the new field isn’t in the committed mdvs.toml. The contributor has to either remove the field, add it to the schema deliberately, or add it to the ignore list — all of which surface the decision.

In practice this means: in CI, always use --no-update. Run mdvs update locally when you want to add new fields, commit the resulting mdvs.toml, and the CI run will then pass.

Caching the install

The installer step downloads a small binary (~6 MB on Linux) and finishes in well under a second. There’s usually no point caching it. If you want to avoid the network call entirely on every run, use actions/cache keyed on the mdvs version string, or commit a vendored binary into the repo and skip the install step.

What `check` does (and doesn’t)

mdvs check covers frontmatter validation only:

✓ Wrong types (a Boolean field with a string value)
✓ Missing required fields per directory
✓ Disallowed fields (anything not in mdvs.toml and not in ignore)
✓ Null violations
✓ Category, length, range, and regex constraint violations
✓ Frontmatter that can’t be parsed at all (broken YAML, broken TOML, broken JSON)

It does not check spelling, link validity, markdown style, or anything in the body content. Pair it with a markdown linter (markdownlint, vale) for those concerns. They run independently and have no conflict — mdvs check and a body-content linter cover orthogonal parts of the file.

Other CI systems

The shape translates directly:

GitLab CI: the same two-step install-then-run pattern in .gitlab-ci.yml. Use the install script under before_script: and run mdvs check --no-update in the job.
CircleCI: an orb or a custom step that installs the binary and invokes the check.
Pre-commit hook: mdvs check --no-update as a hook entry in .pre-commit-config.yaml runs the check locally on every commit, catching issues before they reach CI.

The contract is always the same: install mdvs, run mdvs check --no-update, fail on non-zero exit.

Configuration

All configuration lives in mdvs.toml, created by init and updated by update. This page is a complete reference of every section and field.

Sections overview

mdvs.toml has two groups of sections:

Validation (always present):

[scan] — file discovery
[check] — check command settings
[fields] — field definitions and ignore list

Build & search (written by init, model/chunking filled by first build):

[embedding_model] — model identity
[chunking] — chunk sizing
[build] — build workflow settings
[search] — search defaults and auto-build/update

Global flags

These flags apply to all commands:

Flag	Values	Default	Description
`-o`, `--output`	`pretty`, `markdown`, `json`	`pretty`	Output format. See Output format selection for the resolution chain when `-o` is omitted.
`-v`, `--verbose`			Show detailed output (pipeline steps, expanded records)
`--logs`	`info`, `debug`, `trace`	(none)	Enable diagnostic logging to stderr

Output format selection

When --output / -o isn’t given, mdvs picks a format using this priority chain:

CLI flag — --output pretty|markdown|json always wins when set.
default_output_format in mdvs.toml — a project-level override at the top of the file (default_output_format = "markdown").
Hard fallback — pretty.

Same command → same output. The default does not depend on whether stdout is a terminal, a pipe, or a captured handle. Projects that want a different default (e.g. agent-curated KBs that prefer markdown on every invocation) set default_output_format in mdvs.toml.

The three formats target different consumers:

pretty — box-drawing tables for interactive terminal use. Adapts to terminal width.
markdown — GFM pipe tables and ## section headers. Use this when piping into docs, pasting into a PR description or issue, or when an LLM agent is reading mdvs output into its context — Markdown is the most token-efficient format that LLMs parse fluently.
json — structured JSON for jq pipelines or programmatic consumers that want a strict contract.

Top-level fields

`default_output_format`

Optional. Overrides the hard pretty default for this project. Values: "pretty", "markdown", "json". Always loses to an explicit --output flag.

default_output_format = "markdown"

[scan]
# ...

Useful for vaults where the same default makes sense for every contributor — for example, an agent-curated KB that should produce Markdown for the agent’s context on every invocation without anyone having to remember the flag.

`[scan]`

Controls how markdown files are discovered.

[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
frontmatter_format = "auto"

Field	Type	Default	Description
`glob`	String	`"**"`	Glob pattern for matching markdown files
`include_bare_files`	Boolean	`true`	Include files without frontmatter
`skip_gitignore`	Boolean	`false`	Don’t read `.gitignore` patterns during scan
`frontmatter_format`	String	`"auto"`	Which frontmatter format(s) to accept — see Frontmatter format

When include_bare_files is true, files without frontmatter participate in inference (empty field set) and validation (can trigger MissingRequired). When false, they’re excluded from the scan entirely.

Frontmatter format

mdvs accepts YAML, TOML, and JSON frontmatter. The frontmatter_format field takes one of four values:

Value	Behavior
`"auto"` (default)	Detect per file from the opening delimiter. See the probe table below.
`"yaml"`	Parse every file as YAML; reject `+++` or `{`-opened files with a clear error.
`"toml"`	Parse every file as TOML; reject `---` or `{`-opened files.
`"json"`	Parse every file as JSON; reject `---` or `+++`-opened files.

In auto mode (the default), mdvs reads the first non-empty line of each file to pick the engine:

First non-empty line of a file	Format used
`---`	YAML
`+++`	TOML
starts with `{`	JSON (Hugo convention — the braces are part of the JSON object)
anything else	treated as a bare file (no frontmatter)

The probe is one line per file. A single vault can mix all three formats freely.

The forced modes ("yaml" / "toml" / "json") skip the probe and assume every scanned file uses that format. Files whose actual leading delimiter belongs to a different format produce a FrontmatterUnrepresentable error naming both the configured and detected formats. This is useful for opinionated repos (e.g., a Hugo site committed to TOML that wants mdvs check to fail loudly if someone slips in a --- file).

Naming note. frontmatter_format = "toml" controls how mdvs parses frontmatter in .md files. It has nothing to do with mdvs.toml itself — mdvs.toml is always TOML because it’s a config file. Two unrelated uses of “TOML” in the project.

`[update]`

Placeholder for future update-specific settings. Currently empty — this section is hidden from mdvs.toml by default.

`[check]`

Check command settings.

[check]
auto_update = true

Field	Type	Default	Description
`auto_update`	Boolean	`true` (written by `init`)	Auto-run update before validating

When auto_update is true, check runs the update pipeline (scan, infer, write config) before validating. init writes true so interactive runs pick up new fields automatically. Set to false or pass --no-update for deterministic CI validation against the committed mdvs.toml — the only reason to opt out. The chain is cheap on unchanged corpora, so there’s no performance argument either way for local use.

`[embedding_model]`

Specifies the embedding model for semantic search. See Embedding for available models.

[embedding_model]
provider = "model2vec"
name = "minishlab/potion-multilingual-128M"

Field	Type	Default	Description
`provider`	String	`"model2vec"`	Embedding provider (currently only `"model2vec"`)
`name`	String	`"minishlab/potion-multilingual-128M"`	HuggingFace model ID
`revision`	String	(none)	Pin to a specific HuggingFace commit SHA for reproducibility

The provider field can be omitted — it defaults to "model2vec". The revision field only appears when explicitly set (e.g., via build --set-revision).

Changing the model or revision after a build requires build --force to re-embed all files.

`[chunking]`

Controls semantic text splitting before embedding.

[chunking]
max_chunk_size = 1024

Field	Type	Default	Description
`max_chunk_size`	Integer	`1024`	Maximum chunk size in characters

The text splitter breaks each file’s body into semantic chunks respecting markdown structure (headings, paragraphs, lists). Changing the chunk size after a build requires build --force.

`[build]`

Build workflow settings.

[build]
auto_update = true

Field	Type	Default	Description
`auto_update`	Boolean	`true` (written by `init`)	Auto-run update before building

When auto_update is true, build runs the update pipeline before building. Set to false or pass --no-update for deterministic CI builds against the committed mdvs.toml — the only reason to opt out. The chain is cheap on unchanged corpora (no model load, no Lance write when nothing changed).

`[search]`

Settings for the search command, including how internal columns are named in --where queries.

[search]
default_limit = 10

Field	Type	Default	Description
`default_limit`	Integer	`10`	Maximum results when `--limit` is not specified
`internal_prefix`	String	`""`	Prefix for internal column names in `--where` queries
`aliases`	Map	`{}`	Per-column name overrides for internal columns
`auto_update`	Boolean	`true` (written by `init`)	Auto-run update before building (when `auto_build` is true)
`auto_build`	Boolean	`true` (written by `init`)	Auto-run build before searching

The two [search] auto-flags are what makes a bare mdvs search "query" a one-shot operation — it’ll re-infer, validate, embed, and query in a single command. Set them to false (or use --no-update / --no-build flags) for deterministic CI search against an already-built index, or in airgapped environments where the embedding model can’t be re-downloaded. Locally there’s no performance argument: the auto chain is a no-op when nothing changed.

Internal column names

Beyond your frontmatter fields, the search index stores bookkeeping columns that mdvs uses internally. These internal columns are available in --where queries:

Column	Contains
`filepath`	Relative file path (e.g., `blog/post.md`)
`file_id`	Unique identifier for each file
`chunk_text`	The plain-text body of each chunk (useful for `--where "chunk_text LIKE '%foo%'"`)
`content_hash`	Hash of the file body
`built_at`	Timestamp of last build

(Other columns — chunk_id, chunk_index, start_line, end_line, embedding — exist too but are rarely useful in --where.)

By default, these are referenced by their raw names:

--where "filepath LIKE 'blog/%'"

If a frontmatter field name collides with an internal column name (e.g., you have a field called filepath), the search command will error and suggest resolutions:

Set a prefix so internal columns are addressed with a leading marker in --where:
```
[search]
internal_prefix = "_"
```
Now _filepath, _file_id, etc. refer to the internal columns in --where clauses, leaving the bare filepath free to mean your frontmatter field. (The on-disk column names don’t change — only how the --where translator interprets them.)
Set a per-column alias to rename just the colliding column in --where:
```
[search.aliases]
filepath = "path"
```
Now path refers to the internal filepath column, and bare filepath refers to your frontmatter field.
Rename the frontmatter field in your markdown files.

Aliases take precedence over the prefix. See the Search Guide for full --where reference.

`[fields]`

Defines field constraints and the ignore list. This is the largest section — it contains one [[fields.field]] entry per constrained field.

Ignore list

[fields]
ignore = ["internal_id", "temp_notes"]

Fields in the ignore list are known but unconstrained — they skip all validation and are not reported as new fields by check or update. A field cannot be in both ignore and [[fields.field]].

Field definitions

Each [[fields.field]] entry defines constraints on a frontmatter field:

[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "projects/**"]
required = ["blog/**", "projects/**"]
nullable = false

Field	Type	Default	Description
`name`	String	(required)	Frontmatter key
`type`	FieldType	`"String"`	Expected value type
`allowed`	Array(String)	`["**"]`	Glob patterns where the field may appear
`required`	Array(String)	`[]`	Glob patterns where the field must be present
`nullable`	Boolean	`true`	Whether null values are accepted
`constraints`	Table	(absent)	Optional value constraints (see Constraints)
`preprocess`	Array(String)	`[]`	Stage 2 value preprocessors — see Preprocessors

All fields except name have permissive defaults. A minimal entry with just a name:

[[fields.field]]
name = "title"

is equivalent to:

[[fields.field]]
name = "title"
type = "String"
allowed = ["**"]
required = []
nullable = true

This is not the same as putting the field in the ignore list. Both prevent the field from being reported as new during update, but a [[fields.field]] entry tracks the field — it appears in info output with its type and patterns, and can be targeted by update reinfer. The ignore list simply silences the field: no validation, no detail in info.

Type syntax

Scalar types are plain strings:

type = "String"    # also: "Boolean", "Integer", "Float", "Date", "DateTime"

Date and DateTime accept RFC 3339 values only (YYYY-MM-DD for Date, YYYY-MM-DDTHH:MM:SS[.frac]<Z|±HH:MM> for DateTime). See Date and DateTime for the exact accepted shapes and storage semantics.

Arrays use a function-style string:

type = "Array(String)"

Structured types are not supported on disk. Nested Objects in frontmatter are expressed via dotted-name leaf fields — see Types for the flattening rule. Arrays of structured items (Array(Object{...})) have no first-class representation in v0; use parallel scalar arrays as a workaround:

# Instead of an unsupported Array(Object{timestamp, value}):
[[fields.field]]
name = "measurement_timestamps"
type = "Array(String)"

[[fields.field]]
name = "measurement_values"
type = "Array(Float)"

The valid type grammar is:

Type   := Scalar | Array(Scalar)
Scalar := String | Integer | Float | Boolean | Date | DateTime

See Types for the full type system, including widening rules.

Path patterns

allowed and required are lists of glob patterns matched against relative file paths:

allowed = ["blog/**", "projects/alpha/**"]
required = ["blog/published/**"]

Patterns must end with /* (direct children) or /** (full subtree), or be exactly * or **. Bare paths like blog or file names like blog/post.md are not valid.

The invariant required ⊆ allowed is enforced — every required glob must be covered by some allowed glob. For example, allowed = ["meetings/**"] covers required = ["meetings/all-hands/**"] because any path matching the required pattern also matches the allowed one.

See Schema Inference for how these patterns are computed.

Constraints

The optional [fields.field.constraints] sub-table adds value constraints beyond type checking.

categories — restricts values to an enumerated set (String, Integer, or arrays of either):

[[fields.field]]
name = "status"
type = "String"

[fields.field.constraints]
categories = ["active", "archived", "completed", "draft", "published"]

min / max — restricts numeric values to an inclusive range (Integer, Float, or arrays of either). Both bounds are optional:

[[fields.field]]
name = "rating"
type = "Integer"

[fields.field.constraints]
min = 1
max = 5

min_length / max_length — bounds string length (Unicode scalar count) or array length:

[[fields.field]]
name = "slug"
type = "String"

[fields.field.constraints]
min_length = 3
max_length = 64

pattern — regex applied to string values, compiled at config load:

[[fields.field]]
name = "version"
type = "String"

[fields.field.constraints]
pattern = '^v\d+\.\d+\.\d+$'

Categories are auto-inferred during init and update reinfer. Range constraints are not auto-inferred but can be inferred on demand with update reinfer <field> --with=range. Length and pattern are not auto-inferred — add them by hand. See Constraints for the full reference.

Preprocessors

The optional preprocess array on a field declares value transformations that run before validation. Two built-in stages:

Stage	Applies to	Effect
`coerce-to-string`	`String`, `Array(String)`	Serialize non-string JSON values to their JSON string form before validation
`widen-int-to-float`	`Float`, `Array(Float)`	Treat integer values as their float equivalent

[[fields.field]]
name = "priority"
type = "String"
preprocess = ["coerce-to-string"]

Preprocessors are auto-inferred during init and update reinfer based on observed type-widening events: a field that widened to String because of mixed-type observations gets coerce-to-string; a Float field that observed integers gets widen-int-to-float. An empty preprocess array means strict validation — no coercion.

Each entry must be applicable to the field’s type, and duplicates are rejected at config load. See Types & Widening for the full rules.

Inference thresholds

Two optional fields in [fields] control categorical auto-inference:

[fields]
max_categories = 10
min_category_repetition = 3

Field	Type	Default	Description
`max_categories`	Integer	`10`	Max distinct values for a field to be inferred as categorical
`min_category_repetition`	Integer	`3`	Min average repetition (occurrences / distinct) for categorical inference

These are hidden from mdvs.toml when set to their defaults. They only affect auto-inference — manually written categories are unaffected.

Example

A representative subset from example_kb/mdvs.toml (37 fields total, 4 shown):

[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
frontmatter_format = "auto"

[embedding_model]
provider = "model2vec"
name = "minishlab/potion-multilingual-128M"

[chunking]
max_chunk_size = 1024

[search]
default_limit = 10

[fields]
ignore = []

[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
required = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
nullable = false

[[fields.field]]
name = "tags"
type = "Array(String)"
allowed = ["blog/**", "projects/alpha/*", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/*", "projects/beta/notes/**"]
required = ["blog/published/**", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/notes/**"]
nullable = false

[[fields.field]]
name = "drift_rate"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = true

# Nested YAML (calibration.baseline.wavelength, etc.) is expressed as
# one [[fields.field]] per leaf — see Types.
[[fields.field]]
name = "calibration.baseline.wavelength"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = []
nullable = false

Search Guide

The --where flag on search lets you filter results using SQL syntax. The filter is combined with similarity ranking in a single query — files that don’t match are excluded before results are returned. --where operates on any column in the Lance index: frontmatter fields (auto-discovered from mdvs.toml) and the always-present filepath column (see Filtering by file path).

Under the hood, mdvs hands the clause to LanceDB’s SQL filter, which is built on top of DataFusion — so any expression valid in DataFusion’s SQL dialect works in --where.

Limitation. --where clauses that reference an Array(Float) field (e.g. measurement_values) are rejected up front, because the underlying search engine can’t safely decode them and crashes on read. mdvs catches this before the query runs and returns a clear error. Filter on a scalar field, or store the data as a parallel array of strings, instead.

Scalar fields

Use bare field names for simple comparisons:

String

mdvs search "experiment" --where "status = 'active'"
mdvs search "experiment" --where "author = 'Giulia Ferretti'"
mdvs search "experiment" --where "status IN ('active', 'archived')"
mdvs search "experiment" --where "title LIKE '%sensor%'"

Numeric

mdvs search "experiment" --where "sample_count > 20"
mdvs search "experiment" --where "drift_rate >= 0.01 AND drift_rate <= 0.05"
mdvs search "experiment" --where "wavelength_nm BETWEEN 600 AND 800"

Searched "experiment" — 2 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ experiment                                        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 10                                                │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/alpha/notes/experiment-3.md              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.420                                             │
...

...

Boolean

mdvs search "announcement" --where "draft = false"
mdvs search "ideas" --where "draft = true"

Null checks

mdvs search "notes" --where "drift_rate IS NOT NULL"
mdvs search "notes" --where "review_score IS NULL"

Combining conditions

Use AND, OR, and NOT to build compound filters:

mdvs search "experiment" --where "status = 'active' AND priority = 1"
mdvs search "notes" --where "author = 'REMO' OR author = 'Marco Bianchi'"
mdvs search "notes" --where "NOT status = 'archived'"

Date and DateTime

Fields typed as Date (Arrow Date32) and DateTime (Arrow Timestamp(Millisecond, UTC)) support native date arithmetic, comparisons, and the usual SQL date functions. Auto-inferred from RFC 3339 strings — see Date and DateTime for the type itself.

Direct comparison

mdvs search "researcher" --where "joined > '2024-01-01'"
mdvs search "meeting" --where "date < '2032-01-01'"
mdvs search "calibration" --where "synced_at >= '2024-04-01T00:00:00Z'"

DateTime offsets are normalized to UTC at storage time, so 2024-04-02T16:14:30+02:00 (in a YAML file) and 2024-04-02T14:14:30Z (in a --where clause) compare as the same absolute moment.

Range filters (`BETWEEN`)

mdvs search "meeting" --where "date BETWEEN '2031-09-01' AND '2031-11-30'"
mdvs search "report" --where "joined BETWEEN '2023-01-01' AND '2024-12-31'"

Date functions (`EXTRACT`, `date_part`)

Both extract numeric components from Date and DateTime. Two equivalent syntaxes:

mdvs search "meeting" --where "EXTRACT(YEAR FROM date) = 2031"
mdvs search "meeting" --where "date_part('year', date) = 2031"
mdvs search "meeting" --where "EXTRACT(MONTH FROM date) = 10"
mdvs search "calibration" --where "EXTRACT(YEAR FROM synced_at) = 2024 AND EXTRACT(MONTH FROM synced_at) <= 3"

Date arithmetic with `INTERVAL`

The SQL engine supports adding/subtracting intervals to dates and datetimes.

# Joined within the last 2 years (relative to a cutoff date)
mdvs search "researcher" --where "joined > CAST('2032-01-01' AS DATE) - INTERVAL '2 years'"

# Datetime offset by days
mdvs search "experiment" \
  --where "synced_at < CAST('2024-04-15T00:00:00Z' AS TIMESTAMP) - INTERVAL '7 days'"

CAST('...' AS DATE) and CAST('...' AS TIMESTAMP) are usually needed for string literals on the right side of the arithmetic — the SQL type inference doesn’t always pick the date/timestamp type automatically.

Date subtraction (days between)

Subtracting two Date values returns a number of days (an integer):

# People who joined more than 365 days before a cutoff
mdvs search "researcher" --where "CAST('2032-01-01' AS DATE) - joined > 365"

Null checks

Date and DateTime columns support standard null predicates, including for fields scoped to a subset of directories (rows outside the scope have null values for that column):

mdvs search "protocol" --where "last_reviewed IS NOT NULL"
mdvs search "experiment" \
  --where "drift_rate IS NULL AND filepath LIKE 'projects/alpha/notes/%'"

Combining with other filters

Date filters compose freely with the rest of the language — string compare, IN, LIKE, dotted-leaf access, array operations, and search ranking:

# Blog posts in 2031 H2 by specific authors
mdvs search "research" \
  --where "filepath LIKE 'blog/published/%' AND author IN ('Marco Bianchi', 'Giulia Ferretti') AND date BETWEEN '2031-07-01' AND '2031-12-31'"

# High-or-medium priority experiments with baseline > 700nm synced in 2024
mdvs search "experiment SPR" \
  --where "(priority = 'high' OR priority = 'medium') AND calibration.baseline.wavelength > 700 AND EXTRACT(YEAR FROM synced_at) = 2024"

Array fields

Fields typed as Array(String) (like tags, attendees, action_items) support array functions.

Containment

mdvs search "calibration" --where "array_has(tags, 'calibration')"

Searched "calibration" — 4 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ calibration                                       │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 10                                                │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/alpha/notes/experiment-1.md              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.478                                             │
...

...

The SQL-standard ANY syntax also works:

mdvs search "calibration" --where "'calibration' = ANY(tags)"

Multiple tags

Combine with AND to require multiple values:

mdvs search "calibration" --where "array_has(tags, 'calibration') AND array_has(tags, 'SPR-A1')"

Array length

mdvs search "meeting" --where "array_length(action_items) > 2"

Filtering by file path

Filter results by file path using the filepath column:

mdvs search "experiment" --where "filepath LIKE 'projects/alpha/%'"

Searched "experiment" — 8 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ experiment                                        │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 10                                                │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/alpha/notes/experiment-3.md              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.420                                             │
...

...

File paths are stored as relative paths (e.g., projects/alpha/notes/experiment-1.md). The last component is the filename, so you can match by directory, by filename, or by both:

# All blog posts (directory prefix)
--where "filepath LIKE 'blog/%'"

# Only published blog posts (deeper directory prefix)
--where "filepath LIKE 'blog/published/%'"

# Files in any meetings directory (mid-path match)
--where "filepath LIKE '%/meetings/%'"

# Match by filename suffix (last component)
--where "filepath LIKE '%-postmortem.md'"
--where "filepath LIKE '%/README.md'"

# Exact file
--where "filepath = 'projects/alpha/overview.md'"

# Combine with frontmatter fields
--where "filepath LIKE 'blog/%' AND status = 'published'"

Nested objects

Fields typed as Object (like calibration in example_kb) are stored as nested Struct columns. Access nested values with bracket notation:

mdvs search "sensor" --where "calibration['baseline']['wavelength'] > 600"

Searched "sensor" — 2 hits

┌──────────────────────────┬───────────────────────────────────────────────────┐
│ query                    │ sensor                                            │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ model                    │ minishlab/potion-multilingual-128M               │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ limit                    │ 10                                                │
└──────────────────────────┴───────────────────────────────────────────────────┘

┌ #1 ──────────────────────┬───────────────────────────────────────────────────┐
│ file                     │ projects/alpha/notes/experiment-2.md              │
├──────────────────────────┼───────────────────────────────────────────────────┤
│ score                    │ 0.414                                             │
...

...

The top-level field name (calibration) can be used bare. Only the nested access needs brackets:

# These are equivalent:
--where "calibration['baseline']['wavelength'] > 600"
--where "_data['calibration']['baseline']['wavelength'] > 600"

Field names with special characters

Some field names need quoting in SQL. The init, update, and info commands show hints in their output when this applies.

Spaces

Double-quote the field name:

mdvs search "query" --where "\"lab section\" = 'optics'"

Single quotes in field names

Also use double-quoting:

mdvs search "query" --where "\"author's_note\" IS NOT NULL"

Double quotes in field names

Double the double quotes inside the identifier:

mdvs search "query" --where "\"notes\"\"v2\"\" = true"

String values with special characters

To include a literal single quote inside a string value, double it:

mdvs search "query" --where "title = 'What''s New?'"

mdvs validates quote balance before running the query. If you see “unmatched single quote”, check that every ' in a value is doubled.

Tips

Case sensitivity: field names and string values are case-sensitive. Use LOWER() for case-insensitive matching:
```
--where "LOWER(author) = 'giulia ferretti'"
```

LIKE patterns: % matches any sequence, _ matches a single character:

--where "title LIKE 'Project%'"       # starts with "Project"
--where "title LIKE '%sensor%'"       # contains "sensor"

NULL semantics: comparisons against NULL always return false. Use IS NULL / IS NOT NULL, not = NULL.
No aggregates in –where: functions like COUNT() or SUM() don’t work in --where — the filter applies per-file, not across results.

Keyboard shortcuts

mdvs — Markdown Validation & Search