Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

mdvs treats your markdown directory like a database. It scans your files, infers a typed schema from frontmatter, validates it, and builds a local search index — all in a single binary with no external services.

Not a document database. A database for documents.

The problem

Markdown directories grow organically. You start with a few notes, add frontmatter when it’s useful, and eventually have hundreds of files with inconsistent metadata. Tags are misspelled. Required fields are missing. You can’t find anything without grep.

mdvs gives you structure without forcing you to change how you write.

Frontmatter

Frontmatter is the YAML block between --- fences at the top of a markdown file. It stores structured metadata alongside your content:

---
title: "Experiment A-017: SPR-A1 baseline calibration"    # String
status: completed                                         # String
author: Giulia Ferretti                                   # String
draft: false                                              # Boolean
priority: 2                                               # Integer
drift_rate: 0.023                                         # Float
tags:                                                     # String[]
  - calibration
  - SPR-A1
  - baseline
---
# Your markdown content starts here...

mdvs recognizes these types automatically. When it scans your files, it infers the type of each field from the values it finds — no configuration needed.

Directory-aware schema

mdvs infers a three-dimensional schema from your files:

  • Types — boolean, integer, float, string, arrays, nested objects. Inferred automatically, with widening when files disagree.
  • Paths — which fields belong in which directories. draft only in blog/, sensor_type only in projects/alpha/notes/. Captured as allowed and required glob patterns.
  • Nullability — whether a field can be null. Tracked per field.

This means different directories can have different fields with different constraints — all inferred automatically from your existing files.

Tightest fit: mdvs init infers the strictest schema that’s consistent with your existing files. A field is inferred as allowed in a directory if at least one file there has it. It’s inferred as required if every file there has it. These rules propagate up — if every subdirectory requires a field, the parent directory does too. The result is the tightest set of constraints where check still returns zero violations. You can always loosen them later.

Two layers

mdvs has two distinct capabilities that work independently:

Validation — Scan your files, infer what frontmatter fields exist, which directories they appear in, and what types they have. Write the result to mdvs.toml. Then validate files against that schema. No model, no index, nothing to download.

Search — Chunk your markdown, embed it with a lightweight local model, store the vectors in Parquet files in .mdvs/, and query with natural language. Filter results on any frontmatter field using standard SQL.

You need validation without search? Run mdvs init, customize the fields in mdvs.toml, and run mdvs check.

You want search without validation? Just run mdvs init and mdvs search. The inferred schema is used to extract metadata for search results, but you don’t have to worry about it if you don’t want to.

Use them together for the best experience, or separately if that’s what you need.

Using a nested directory of markdown files as a database

You can think of mdvs as a layer on top of your markdown files that gives you database-like capabilities. Here’s a rough mapping of concepts and commands:

ConceptDatabasemdvs
Define structureCREATE TABLEmdvs init
Per-table columnsDifferent columns per tablePer-directory fields via allowed/required globs
Enforce constraintsConstraint validationmdvs check
Evolve structureALTER TABLEmdvs update
Create an indexCREATE INDEXmdvs build
QuerySELECT ... WHERE ... ORDER BYmdvs search --where

Two artifacts: mdvs.toml (your schema, to be committed) and .mdvs/ (the search index, can be ignored by version control).

What this book covers

This book uses a fictional research lab knowledge base (example_kb) as a running example. Every command, every output, every query is real and reproducible.

  • Getting Started — Install mdvs and run it on the example vault
  • Concepts — How schema inference, types, and validation work
  • Commands — Full reference for all 7 commands
  • Configuration — The mdvs.toml file explained
  • Search Guide — SQL filtering, array queries, and ranking
  • Recipes — Obsidian setup, CI integration

Getting Started

Install mdvs, run it on a real directory, and search your first query — all in under five minutes.

Install

cargo install mdvs

You need a working Rust toolchain. Prebuilt binaries will be available once the crate is published.

Get the example files

This book uses a fixture called example_kb — a fictional research lab’s knowledge base with ~46 markdown files, varied frontmatter, and a few deliberate inconsistencies. Clone the repo to follow along:

git clone https://github.com/edochi/mdvs.git
cd mdvs

Initialize

Run mdvs init on the example directory:

mdvs init example_kb

mdvs scans every markdown file, extracts frontmatter, and infers a typed schema:

Initialized 43 files — 37 field(s)

╭─────────────────────┬───────────────────────┬───────┬────────────────────────╮
│ "action_items"      │ String[]              │ 9/43  │                        │
│ "algorithm"         │ String                │ 2/43  │                        │
│ "ambient_humidity"  │ Float                 │ 1/43  │                        │
│ "approved_by"       │ String                │ 4/43  │                        │
│ "attendees"         │ String[]              │ 10/43 │                        │
│ "author"            │ String                │ 18/43 │                        │
│ "author's_note"     │ String                │ 3/43  │ ' → '' in --where      │
│ "calibration"       │ {adjusted: {intensity │ 2/43  │                        │
│                     │ : Float, wavelength:  │       │                        │
│                     │ Float}, baseline: {in │       │                        │
│                     │ tensity: Float, notes │       │                        │
│                     │ : String, wavelength: │       │                        │
│                     │  Float}}              │       │                        │
│ "commission_date"   │ String                │ 1/43  │                        │
│ "convergence_ms"    │ Integer               │ 1/43  │                        │
│ "dataset"           │ String                │ 2/43  │                        │
│ "date"              │ String                │ 17/43 │                        │
│ "draft"             │ Boolean               │ 8/43  │                        │
│ "drift_rate"        │ Float?                │ 3/43  │                        │
│ "duration_minutes"  │ Integer               │ 10/43 │                        │
│ "email"             │ String                │ 4/43  │                        │
│ "equipment_id"      │ String                │ 2/43  │                        │
│ "firmware_version"  │ String                │ 1/43  │                        │
│ "joined"            │ String                │ 5/43  │                        │
│ "lab section"       │ String                │ 4/43  │ use "field name" in -- │
│                     │                       │       │ where                  │
│ "last_reviewed"     │ String                │ 4/43  │                        │
│ "notes"v2""         │ Boolean               │ 1/43  │ " → "" in --where      │
│ "observation_notes" │ String                │ 1/43  │                        │
│ "priority"          │ String                │ 7/43  │                        │
│ "project"           │ String                │ 4/43  │                        │
│ "publications"      │ Integer               │ 2/43  │                        │
│ "review_score"      │ String?               │ 1/43  │                        │
│ "role"              │ String                │ 5/43  │                        │
│ "sample_count"      │ Integer               │ 3/43  │                        │
│ "sensor_type"       │ String                │ 3/43  │                        │
│ "specialization"    │ String                │ 2/43  │                        │
│ "status"            │ String                │ 17/43 │                        │
│ "tags"              │ String[]              │ 16/43 │                        │
│ "title"             │ String                │ 37/43 │                        │
│ "unit_id"           │ String                │ 1/43  │                        │
│ "version"           │ String                │ 4/43  │                        │
│ "wavelength_nm"     │ Float                 │ 3/43  │                        │
╰─────────────────────┴───────────────────────┴───────┴────────────────────────╯

Initialized mdvs in 'example_kb'

That command did three things:

  1. Scanned 43 markdown files and extracted their YAML frontmatter
  2. Inferred 37 typed fields — strings, integers, floats, booleans, arrays, even a nested object (calibration)
  3. Wrote mdvs.toml with the inferred schema

Notice the third column: draft appears in 8/43 files — all in blog/. sensor_type in 3/43 — all in projects/alpha/notes/. mdvs captured not just the types, but where each field belongs. Run mdvs init example_kb -v to see the full path patterns.

Here’s what a field definition looks like in mdvs.toml:

[[fields.field]]
name = "sensor_type"
type = "String"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = false

This means sensor_type is allowed only in experiment notes, and required there. If it appears in a blog post, check will flag it. If it’s missing from an experiment note, check will flag that too.

One artifact is created by init: mdvs.toml — the schema file. Commit this to version control. The .mdvs/ directory (search index) is created later on first build or search.

Validate

Check that every file conforms to the schema:

mdvs check example_kb
Checked 43 files — no violations

Since mdvs init just inferred the schema from these same files, everything passes. The power of check comes after you tighten the schema — or when files drift from it. Try adding sensor_type: SPR-A1 to a blog post — mdvs will flag it as Disallowed because that field doesn’t belong there.

What violations look like

Open mdvs.toml and make a few changes to tighten the constraints:

  • Require observation_notes in all experiment files (currently optional)
  • Change convergence_ms type from Integer to Boolean (simulating a type mismatch)
  • Set drift_rate to non-nullable (one file has drift_rate: null)
  • Restrict firmware_version to only appear in people/interns/** (it currently appears in people/*)

Run check again:

mdvs check example_kb
Checked 43 files — 4 violation(s)

╭───────────────────────────────┬───────────────────────────┬──────────────────╮
│ "convergence_ms"              │ WrongType                 │ 1 file           │
│ "drift_rate"                  │ NullNotAllowed            │ 1 file           │
│ "firmware_version"            │ Disallowed                │ 1 file           │
│ "observation_notes"           │ MissingRequired           │ 2 files          │
╰───────────────────────────────┴───────────────────────────┴──────────────────╯

Four violation types, each catching a different kind of problem:

ViolationMeaning
MissingRequiredA file in a required path is missing the field
WrongTypeThe value doesn’t match the declared type
NullNotAllowedThe field is present but null, and nullable is false
DisallowedThe field appears in a file outside its allowed paths

This is the compact output — it groups violations by field. Add -v for verbose output showing every affected file and the specific value that caused the violation. See check for the full reference.

Revert your changes to mdvs.toml before continuing (or re-run mdvs init example_kb --force to regenerate it).

Query the index with natural language. On first run, search auto-builds the index:

Note: The first search or build downloads the embedding model from HuggingFace (~30 MB for the default model). This is a one-time download — subsequent runs use the cached model and start instantly.

mdvs search "calibration" example_kb
Built index — 43 files, 59 chunks (full rebuild)

╭─────────────────────────┬─────────────────────────┬──────────────────────────╮
│ embedded                │ 43 files                │ 59 chunks                │
╰─────────────────────────┴─────────────────────────┴──────────────────────────╯

Searched "calibration" — 10 hits

╭────────────┬──────────────────────────────────────────────────┬──────────────╮
│ 1          │ "projects/alpha/meetings/2031-06-15.md"          │ 0.585        │
│ 2          │ "projects/alpha/meetings/2031-10-10.md"          │ 0.501        │
│ 3          │ "projects/alpha/notes/experiment-1.md"           │ 0.478        │
│ 4          │ "blog/drafts/upcoming-talk.md"                   │ 0.470        │
│ 5          │ "blog/published/2032/q1/new-equipment.md"        │ 0.466        │
│ 6          │ "meetings/all-hands/2032-01.md"                  │ 0.465        │
│ 7          │ "projects/alpha/overview.md"                     │ 0.462        │
│ 8          │ "projects/beta/overview.md"                      │ 0.449        │
│ 9          │ "reference/tools.md"                             │ 0.445        │
│ 10         │ "people/remo.md"                                 │ 0.437        │
╰────────────┴──────────────────────────────────────────────────┴──────────────╯

Results are ranked by semantic similarity — not keyword matching. The score column is cosine similarity (higher means more similar).

Filtering with --where

Add a SQL filter on any frontmatter field:

mdvs search "quantum" example_kb --where "status = 'active'"
Searched "quantum" — 3 hits

╭───────────────┬──────────────────────────────────────────┬───────────────────╮
│ 1             │ "projects/beta/overview.md"              │ 0.123             │
│ 2             │ "projects/alpha/overview.md"             │ 0.101             │
│ 3             │ "projects/alpha/budget.md"               │ 0.055             │
╰───────────────┴──────────────────────────────────────────┴───────────────────╯

Only files with status: active in their frontmatter are included. The --where clause supports any SQL expression — boolean logic, comparisons, array functions, and more. See the Search Guide for the full syntax.

What’s next

  • Concepts — How schema inference, types, and validation work under the hood
  • Commands — Full reference for every command and flag
  • Configuration — Customize mdvs.toml to tighten your schema
  • Search Guide — Complex queries: arrays, nested objects, combined filters

Concepts

mdvs has two layers — validation and search — each with its own set of concepts. These pages explain how things work under the hood.

  • Types & Widening — The type system, how types are inferred from values, and what happens when files disagree
  • Schema Inference — How mdvs scans your directory and computes field paths, requirements, and constraints
  • Validation — What check verifies, the four violation types, and how to read the output
  • Search & Indexing — Chunking, embeddings, incremental builds, and how results are ranked

Types & Widening

mdvs infers a type for every frontmatter field it encounters. When the same field appears with different types across files, mdvs resolves the conflict automatically through type widening.

The six types

TypeYAML exampleexample_kb field
Booleandraft: falsedraft in blog posts
Integersample_count: 24sample_count in experiments
Floatdrift_rate: 0.023drift_rate in experiments
Stringauthor: Giulia Ferrettiauthor across many files
Arraytags: [calibration, SPR-A1]tags in projects and blog
Objectcalibration: {baseline: ...}calibration in experiment-2

Arrays carry an element type — String[], Integer[], etc. Objects carry named sub-fields, and can nest arbitrarily deep:

calibration:
  baseline:
    wavelength: 632.8
    intensity: 0.95
  adjusted:
    wavelength: 633.1
    intensity: 0.97

This infers as {baseline: {wavelength: Float, intensity: Float}, adjusted: {wavelength: Float, intensity: Float}}.

Type hierarchy

When two values have different types, mdvs widens to a common type. The hierarchy looks like this:

graph BT
    Integer --> Float
    Float --> String
    Boolean --> String
    Array["Array(T)"] --> String
    Object["Object({...})"] --> String

Each arrow means “widens to.” String is the top type — every type eventually reaches it.

The one special case is Integer → Float: integers widen to floats (not directly to String) because the conversion is lossless.

Two same-category combinations widen internally instead of jumping to String:

  • Array + Array — element types are widened recursively (e.g., Integer[] + String[] → String[])
  • Object + Object — keys are merged, and shared keys have their values widened recursively

Everything else (Boolean + any other type, Array + scalar, Object + scalar, Array + Object) widens to String.

Type widening in practice

When mdvs scans your files and the same field has different types, it picks the least upper bound — the most specific type that covers all observed values.

Integer + Float → Float

In example_kb, the wavelength_nm field appears in three experiment notes:

# experiment-1.md
wavelength_nm: 850       # Integer

# experiment-2.md
wavelength_nm: 632.8     # Float

# experiment-3.md
wavelength_nm: 780.0     # Float

Result: wavelength_nm is inferred as Float. The integer 850 is safely represented as a float.

Integer + String → String

The priority field uses numbers in one project and text in another:

# projects/alpha/overview.md
priority: 1              # Integer

# projects/beta/overview.md
priority: high           # String

Result: priority is inferred as String. There’s no numeric type that can hold "high", so mdvs widens to String.

Boolean + any non-Boolean → String

If the same field is true in one file and 3 in another, there’s no numeric or boolean type that can hold both. The result is String.

This doesn’t happen in example_kb because booleans (draft) are used consistently — but it’s a common mistake in organically grown vaults where someone writes draft: yes (String) instead of draft: true (Boolean).

Array element widening

The tags field is a string array in most files, but one file accidentally used integers:

# projects/alpha/overview.md
tags:
  - biosensor
  - metamaterial          # String[]

# projects/beta/notes/replication.md
tags:
  - 1
  - 2
  - 3                     # Integer[]

Result: tags is inferred as String[]. The array element types (String vs Integer) are widened to String, giving String[].

Object key merging

When two files have the same Object field with different keys, mdvs merges all keys. If a key appears in both files with different value types, the value is widened.

In example_kb, the calibration object appears in two experiment files with different structures:

# experiment-1.md (simpler calibration, integer values)
calibration:
  baseline:
    wavelength: 850            # Integer
    intensity: 1               # Integer
    notes: "initial reference" # only in this file

# experiment-2.md (full calibration, float values)
calibration:
  baseline:
    wavelength: 632.8          # Float
    intensity: 0.95            # Float
  adjusted:                    # only in this file
    wavelength: 633.1
    intensity: 0.97

Result: calibration is inferred as:

{
  "adjusted": {
    "intensity": "Float",
    "wavelength": "Float"
  },
  "baseline": {
    "intensity": "Float",
    "notes": "String",
    "wavelength": "Float"
  }
}

What happened:

  • baseline appears in both → keys merged, values widened: wavelength Integer + Float → Float, intensity Integer + Float → Float, notes only in experiment-1 → kept as String
  • adjusted only in experiment-2 → kept as-is

The full widening matrix

Every possible combination of types and its result:

BooleanIntegerFloatStringArrayObject
BooleanBooleanStringStringStringStringString
IntegerStringIntegerFloatStringStringString
FloatStringFloatFloatStringStringString
StringStringStringStringStringStringString
ArrayStringStringStringStringArray*String
ObjectStringStringStringStringStringObject*

* Array + Array: element types are widened recursively.

* Object + Object: keys are merged; shared keys are widened recursively.

The matrix is symmetric — widen(A, B) always equals widen(B, A).

Nullable

Separately from the type, mdvs tracks whether null was observed for a field. This is shown as a ? suffix in output — e.g., Float? means “Float, but sometimes null.”

How it works

In example_kb, the drift_rate field is Float in two experiment files but null in a third:

# experiment-1.md
drift_rate: 0.023        # Float

# experiment-2.md
drift_rate: null          # sensor malfunction — Giulia discarded the data

# experiment-3.md
drift_rate: 0.012         # Float

Result: drift_rate is inferred as Float? — the type is Float (null doesn’t affect the type), and nullable is set to true.

Null-only fields

If the only value ever observed is null, the type defaults to String:

# blog/drafts/grant-ideas.md
review_score: null        # no real values seen

Result: review_score is inferred as String?.

Key rules

  • Null is transparent in widening — it doesn’t affect the inferred type
  • Null-only fields default to String (the safest fallback)
  • nullable is a separate boolean, not part of the type itself
  • In validation: null values skip type checks, but a non-nullable required field with a null value triggers a NullNotAllowed violation (see Validation)

String is the top type

This has two important consequences:

In validation — a field typed as String never triggers a WrongType violation. If priority is String, then priority: 1, priority: true, and priority: [a, b] all pass. The value is stored as-is.

In storage — when building the search index, non-string values in String-typed fields are serialized to JSON. So priority: 1 in a String field is stored as "1", not silently dropped as NULL. No data is ever lost.

There’s also a leniency for Float fields: integer values like 5 pass as Float (since every integer is a valid float). This handles the common case where YAML doesn’t distinguish 5 from 5.0.

Edge cases

  • Empty arrays [] default to String[] — if real values are added later, the field must be re-inferred with mdvs update --reinfer <field> to pick up the new element type
  • Empty frontmatter (--- followed immediately by ---) is a file with zero fields — not a bare file. It still counts as “having frontmatter” for inference purposes.
  • Bare files (no --- fences at all) are handled differently — see Schema Inference

Schema Inference

mdvs infers a typed schema from your files automatically — no manual schema definition needed. Run mdvs init, and it scans every markdown file, extracts frontmatter, infers types, and computes path patterns that describe where each field appears. The result is mdvs.toml, which you can then tighten by hand.

What gets scanned

mdvs walks your directory and includes every .md and .markdown file that matches the glob pattern in [scan]:

[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false

Three settings control what’s included:

SettingDefaultEffect
glob"**"Which files to scan. Use narrower globs to exclude subtrees.
include_bare_filestrueWhether to include files without any YAML frontmatter
skip_gitignorefalseWhether to ignore .gitignore patterns during scan

mdvs also respects .mdvsignore files (same syntax as .gitignore) for excluding paths from scanning without touching your .gitignore.

Bare files vs empty frontmatter

These look similar but are different:

Bare file — no frontmatter fences at all:

This file has no frontmatter. Just content.

Empty frontmatter — fences with nothing between them:

---
---
This file has frontmatter, but zero fields.

In example_kb, four files are bare (scratch.md, lab-values.md, reference/tools.md, reference/glossary.md) and one has empty frontmatter (reference/quick-start.md).

Both types contribute zero fields to inference. The difference matters for validation: a bare file is excluded entirely when include_bare_files = false, while an empty-frontmatter file is always included (it has frontmatter — just none with fields).

From files to fields

For each scanned file, mdvs extracts the YAML frontmatter and infers a type for every key. When the same field appears across multiple files, its type is widened to a common type (see Types & Widening for the full rules).

In example_kb, scanning 43 files produces 37 distinct field names. Some fields like title appear in 37 files. Others like unit_id appear in just one.

The output of this step is a list of fields, each with:

  • A name
  • A type (widened across all files where it appears)
  • A nullable flag (true if any file had a null value)
  • The set of files where it was found

Path patterns

The most interesting part of inference is how mdvs computes where each field belongs. It produces two sets of glob patterns per field:

  • allowed — where the field may appear. Any file matching these patterns can have the field without triggering a violation.
  • required — where the field must appear. Any file matching these patterns that’s missing the field triggers a MissingRequired violation.

How patterns are computed

mdvs builds a directory tree from the scanned files and works bottom-up:

  1. For each directory, it tracks which fields appear in all files (intersection) and which appear in any file (union)
  2. When a field appears in every file under a directory and its subdirectories, it collapses into a recursive glob (dir/**)
  3. When a field appears in some but not all files, only allowed gets the glob — required does not

The result is a minimal set of globs that describes the field’s distribution.

Examples from example_kb

Narrow and consistentsensor_type appears in all three experiment notes and nowhere else:

[[fields.field]]
name = "sensor_type"
type = "String"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]

allowed and required are the same — every file that has this field is in the same directory, and every file in that directory has it.

Broad and consistenttitle appears in 37 of 43 files across many directories:

[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
required = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]

Again, allowed equals required — every file in those directories has a title. The five directories without title are bare files at the root and in reference/.

Allowed broader than requiredemail exists in all people/ files except one:

[[fields.field]]
name = "email"
type = "String"
allowed = ["people/**"]
required = ["people/interns/**"]

allowed is people/** — the field may appear anywhere under people/. But required is only people/interns/** — the one subdirectory where every file happens to have it. In people/* (the non-intern profiles), some have email and some don’t, so it can’t be required there.

Present but never requiredambient_humidity appears in only one of three experiment notes:

[[fields.field]]
name = "ambient_humidity"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = []

required is empty — the field never appears in every file under any directory, so mdvs can’t require it anywhere.

The pattern

The general rule is required ⊆ allowed — you can’t require a field somewhere it’s not allowed. Within that:

  • required = allowed when every file in a directory has the field
  • required ⊂ allowed when the field is consistent in some directories but sporadic in others
  • required = [] when the field is sporadic — present in some files but not consistently in any directory

The three field states

Every field in mdvs.toml is in one of three states:

Constrained

Listed under [[fields.field]]. Validation enforces type, allowed paths, required paths, and nullable. mdvs update preserves constrained fields unless you explicitly pass --reinfer.

[[fields.field]]
name = "draft"
type = "Boolean"
allowed = ["blog/**"]
required = ["blog/**"]
nullable = false

Only name is required — properties you omit use permissive defaults:

PropertyDefaultMeaning
typeStringAccepts any value (String is the top type)
allowed["**"]Allowed in every file
required[]Not required anywhere
nullabletrueNull values accepted

A [[fields.field]] with just a name is effectively unconstrained, but still known — useful when you want to acknowledge a field without committing to specific constraints yet.

Ignored

Listed in the ignore array. The field is known but not validated — no type checks, no path checks. mdvs update skips ignored fields entirely.

[fields]
ignore = ["internal_notes", "scratch_data"]

Use this for fields you don’t want to enforce — temporary fields, fields in flux, or fields you’ve decided aren’t worth constraining.

Unknown

Not mentioned in mdvs.toml at all. When mdvs update finds a field that isn’t constrained or ignored, it reports it as a new field and adds it to the schema.

A field can be in exactly one state. Moving a field from constrained to ignored means removing its [[fields.field]] entry and adding its name to ignore. Moving it back means the reverse.

Keeping the schema current

After initial inference with mdvs init, the schema is a snapshot of your files at that moment. As files change — new fields appear, old ones shift — use mdvs update to bring the schema up to date.

Default mode

mdvs update example_kb

Only new fields are added. Existing fields are left untouched, even if their types or paths have changed. This is conservative by design — your manual edits to mdvs.toml are preserved.

Fields that disappear from all files still stay in the toml. This prevents accidental removal when files are temporarily missing.

Re-inferring specific fields

mdvs update example_kb --reinfer tags

Treats tags as if it had never been seen — removes it from the schema, re-scans, and infers it fresh. Use this when you’ve fixed bad data (like a tags: [1, 2, 3] that should have been strings) and want the type or paths to update.

Re-inferring everything

mdvs update example_kb --reinfer-all

Equivalent to running --reinfer on every field. The entire [[fields.field]] section is rebuilt from scratch, but all other config ([scan], [embedding_model], etc.) is preserved.

This is different from mdvs init --force, which overwrites the entire mdvs.toml including non-field config.

Edge cases

  • Fields in a single file — get a narrow allowed glob matching just that file’s directory. Example: unit_id only in people/remo.mdallowed = ["people/*"].
  • Null-only fields — type defaults to String (see Types & Widening). Example: review_score is always nullString?.
  • Special characters in field names — names with spaces (lab section), single quotes (author's_note), or double quotes (notes"v2") are preserved as-is. They need quoting in --where clauses (see Search Guide).
  • Empty arrays [] — element type defaults to String, giving String[]. If real values appear later, use --reinfer to pick up the correct element type.

Validation

mdvs check validates every file’s frontmatter against the schema in mdvs.toml. It’s read-only, deterministic, and produces no side effects — it just tells you what’s wrong.

The four violations

ViolationMeaning
WrongTypeThe value doesn’t match the declared type
DisallowedThe field appears in a file outside its allowed paths
MissingRequiredA file matches a required glob but doesn’t have the field
NullNotAllowedThe field is present but null, and nullable is false

WrongType

Fires when a value doesn’t match the declared type. If convergence_ms is declared as Boolean but a file has convergence_ms: 42, the integer value fails the boolean check.

This violation has two important leniencies — see Type checking rules below.

Disallowed

Fires when a field appears in a file whose path doesn’t match any of the field’s allowed globs. For example, if firmware_version has allowed = ["people/interns/**"] but appears in people/remo.md, that file is outside the allowed paths.

MissingRequired

Fires when a file’s path matches one of the field’s required globs, but the file doesn’t contain that field at all.

For example, if observation_notes has required = ["projects/alpha/notes/**"], then every file under projects/alpha/notes/ must have it. Files that don’t → MissingRequired.

NullNotAllowed

Fires when a field is present with an explicit null value, but nullable is false. For example, if drift_rate has nullable = false and a file has drift_rate: null.

This is distinct from a missing field — see Null vs absent below.

Type checking rules

Two leniencies make validation practical for real-world YAML:

String accepts any value. Since String is the top type (see Types & Widening), a String-typed field never triggers a WrongType violation. Booleans, integers, arrays — everything is accepted. This is by design: when types are widened to String during inference, the field should accept whatever values caused the widening.

Float accepts integers. An integer value like 5 passes validation for a Float field. YAML doesn’t distinguish 5 from 5.0, and many editors strip trailing .0. Rejecting integers from Float fields would cause constant false positives.

Arrays check element types recursively — an Integer[] field rejects ["a", "b"] because the string elements fail the Integer check.

Objects just check that the value is an object — individual keys are not validated against the inferred structure.

Null handling

Null interacts with validation in specific ways:

All four checks are independent. A null value is checked like any other value — each violation type is evaluated separately:

  • WrongType — null is accepted by any type, so this never fires on null.
  • Disallowed — the field is present (the key exists), so Disallowed fires if the path isn’t in allowed.
  • MissingRequired — null counts as “present”, so this never fires on null.
  • NullNotAllowed — fires when the value is null and nullable = false.

A single null field can trigger both Disallowed and NullNotAllowed at the same time.

Null vs absent. These are different situations with different outcomes:

SituationExampleResult
Field is absentFile has no drift_rate key at allMissingRequired (if path matches required)
Field is null, nullable = truedrift_rate: nullPasses
Field is null, nullable = falsedrift_rate: nullNullNotAllowed

A null value counts as “present” — the field key exists in the frontmatter, it just has no value. So null never triggers MissingRequired. An absent field is genuinely missing — it can trigger MissingRequired but never NullNotAllowed.

Note: In YAML, unquoted null is a null value, not the string "null". To store the literal string, write drift_rate: "null" (with quotes).

New fields

When mdvs check encounters a frontmatter field that isn’t in mdvs.toml — neither constrained under [[fields.field]] nor listed in ignore — it reports it as a new field.

New fields are informational only. They don’t count as violations and don’t affect the exit code:

Checked 43 files — no violations, 1 new field(s)

╭──────────────────────────────┬─────────────────────┬─────────────────────────╮
│ "algorithm"                  │ new                 │ 2 files                 │
╰──────────────────────────────┴─────────────────────┴─────────────────────────╯

They’re shown in the output so you know to either run mdvs update to add them to the schema, or add them to the ignore list.

Bare files

When include_bare_files = true in [scan], bare files (no frontmatter at all) are included in validation. Since they have no fields, they trigger MissingRequired for any required glob matching their path.

For example, if title has required = ["**"] and scratch.md is a bare file, it triggers MissingRequired for title. This is often why the inferred schema uses narrower required globs — bare files at the root prevent required = ["**"] from being inferred for fields that don’t appear in them.

Check and build

mdvs build runs the same validation internally before embedding. If any violations are found, build aborts — no dirty data reaches the index. The violations are the same ones check would report.

This means you can use check as a dry run before building, but you don’t have to — build will catch the same problems.

Exit codes

Exit codeMeaning
0No violations (new fields don’t count)
1One or more violations found
2Scan or config error (couldn’t run validation)

Search & Indexing

mdvs builds a search index by chunking your markdown content, embedding it with a local model, and storing everything in Parquet files. Queries are embedded with the same model and ranked by cosine similarity, with optional SQL filtering on frontmatter fields.

Building the index

mdvs build (or mdvs init with auto-build) creates the search index in three steps: chunk, embed, store.

Chunking

Each file’s markdown body is split into semantic chunks — respecting headings, paragraphs, and code blocks rather than cutting at arbitrary character boundaries. The maximum chunk size is configurable (default 1024 characters) via the [chunking] section in mdvs.toml:

[chunking]
max_chunk_size = 1024

Each chunk tracks its start and end line numbers in the original file, so search results can point to the exact location.

Embedding

Chunks are embedded into dense vectors using a local Model2Vec model by Minish — static embeddings that run on CPU with no external services or GPU required. The model is downloaded from HuggingFace to the local cache on first use.

[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"

The default is potion-base-8M, a good balance of size and quality. The full POTION family:

ModelParametersNotes
minishlab/potion-base-2M2MSmallest, fastest
minishlab/potion-base-8M8MDefault — good balance
minishlab/potion-base-32M32MHigher quality, slower
minishlab/potion-retrieval-32M32MOptimized for retrieval tasks
minishlab/potion-multilingual-128M128M101 languages

Any Model2Vec-compatible model on HuggingFace works — set the name to its model ID. You can pin a specific revision for reproducibility.

Storage

Two Parquet files are written to .mdvs/:

  • files.parquet — one row per file. Contains the filename, all frontmatter fields (in a single Struct column), a content hash, and a build timestamp.
  • chunks.parquet — one row per chunk. Contains the chunk’s position (file, index, line range) and its embedding vector.

The files.parquet holds your frontmatter as structured data — this is what --where filters query against. The chunks.parquet holds the vectors that similarity search operates on. The two are joined by file ID at query time.

Incremental builds

Build only re-embeds what changed. Each file’s markdown body (excluding frontmatter) is hashed, and the hash is compared against the existing index:

ClassificationConditionAction
NewFile not in indexChunk, embed, add
EditedHash changedRe-chunk, re-embed, replace chunks
UnchangedHash matchesKeep existing chunks
RemovedIn index but not on diskDrop file and its chunks

Frontmatter-only changes (adding a tag, fixing a typo in author) update files.parquet without re-embedding — the body hash hasn’t changed, so the vectors are still valid.

When nothing needs embedding, the model isn’t even loaded. A --force flag triggers a full rebuild regardless of hashes.

How search works

When you run mdvs search "query" example_kb:

  1. The query text is embedded with the same model used during build
  2. Every chunk’s embedding is compared to the query via cosine similarity
  3. For each file, only the best chunk score is kept — a file with one highly relevant section ranks above a file with uniformly mediocre content
  4. Results are sorted by score (highest first) and limited by --limit (default 10)

This is brute-force search — every chunk is compared. For the typical vault size (hundreds to low thousands of files), this is fast enough. The entire search runs in-process with no external services.

Scores

The score column in search output is cosine similarity — a value between 0 and 1, where higher means more similar. Scores depend on the model and the content, so there’s no universal threshold for “relevant.” Compare scores relative to each other within a single query.

Filtering with --where

Add a SQL filter to narrow results by frontmatter fields:

mdvs search "calibration" example_kb --where "status = 'active'"

The --where clause filters on frontmatter fields — only files that match the filter are included in the results. The filter and similarity ranking are combined in a single query, so files that don’t match are excluded efficiently.

You can use any SQL expression that DataFusion supports:

--where "draft = false"
--where "status = 'active' AND author = 'Giulia Ferretti'"
--where "sample_count > 10"

Array fields, nested objects, and field names with special characters require specific syntax — see the Search Guide for the full reference.

Model identity

Search refuses to run if the model configured in mdvs.toml doesn’t match the model that was used to build the index. This is a hard error, not a warning.

Embeddings from different models are incompatible — cosine similarity between vectors from different models produces meaningless scores. If you change the model, rebuild the index with mdvs build --force.

Commands

mdvs provides seven commands covering the full workflow — from schema setup to search.

Schema & validation:

  • init — Scan a directory, infer a typed schema, and write mdvs.toml
  • check — Validate frontmatter against the schema
  • update — Re-scan files, infer new fields, and update the schema

Search index:

  • build — Validate, embed, and write the search index
  • search — Query the index with natural language

Utilities:

  • info — Show config and index status
  • clean — Delete the search index

init

Scan a directory, infer a typed schema, and write mdvs.toml.

Usage

mdvs init [path] [flags]

Flags

FlagDefaultDescription
path.Directory to scan
--glob**Glob pattern for matching markdown files
--forceOverwrite existing mdvs.toml
--dry-runPreview the inferred schema without writing anything
--ignore-bare-filesExclude files without YAML frontmatter
--skip-gitignoreDon’t read .gitignore patterns during scan

Global flags (-o, -v, --logs) are described in Configuration.

What it does

init scans every markdown file, extracts YAML frontmatter, infers a typed schema with path patterns, and writes mdvs.toml. It does not build the search index — run build or search for that.

See Getting Started for a full walkthrough with output, and Schema Inference for how types and path patterns are computed.

One artifact is created: mdvs.toml — the schema file. Commit this to version control.

If mdvs.toml or .mdvs/ already exists, init refuses to run unless you pass --force. With --force, both mdvs.toml and .mdvs/ are deleted before proceeding. To update an existing schema without overwriting it, use update instead.

init --force vs update --reinfer-all

Both re-infer the schema from scratch, but they differ in scope:

  • init --force overwrites the entire mdvs.toml — all sections, including [scan], [fields], and any build sections. Any manual edits are lost. .mdvs/ is also deleted.
  • update --reinfer-all re-infers only the [fields] section. All other config is preserved.

Output

Compact (default)

mdvs init example_kb
Initialized 43 files — 37 field(s)

╭─────────────────────┬───────────────────────┬───────┬────────────────────────╮
│ "action_items"      │ String[]              │ 9/43  │                        │
│ "algorithm"         │ String                │ 2/43  │                        │
│ "ambient_humidity"  │ Float                 │ 1/43  │                        │
│ ...                 │                       │       │                        │
│ "drift_rate"        │ Float?                │ 3/43  │                        │
│ ...                 │                       │       │                        │
│ "lab section"       │ String                │ 4/43  │ use "field name" in -- │
│                     │                       │       │ where                  │
│ ...                 │                       │       │                        │
│ "title"             │ String                │ 37/43 │                        │
│ "wavelength_nm"     │ Float                 │ 3/43  │                        │
╰─────────────────────┴───────────────────────┴───────┴────────────────────────╯

Initialized mdvs in 'example_kb'

Each row shows the field name, inferred type, how many files contain it (e.g., 9/43), and optional hints for --where syntax (see Search Guide for details on quoting and escaping). The ? suffix on a type (e.g., Float?) means the field is nullable.

Verbose (-v)

mdvs init example_kb -v
Initialized 43 files — 37 field(s)

╭────────────────────────────────┬────────────────────────┬────────────────────╮
│ "action_items"                 │ String[]               │ 9/43               │
├────────────────────────────────┴────────────────────────┴────────────────────┤
│   required:                                                                  │
│     - "meetings/all-hands/**"                                                │
│     - "projects/alpha/meetings/**"                                           │
│     - "projects/beta/meetings/**"                                            │
│   allowed:                                                                   │
│     - "meetings/**"                                                          │
│     - "projects/alpha/meetings/**"                                           │
│     - "projects/beta/meetings/**"                                            │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────┬─────────────────────┬────────────────────╮
│ "ambient_humidity"                │ Float               │ 1/43               │
├───────────────────────────────────┴─────────────────────┴────────────────────┤
│   allowed:                                                                   │
│     - "projects/alpha/notes/**"                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────┬──────────────────────────┬────────────────────╮
│ "drift_rate"                 │ Float?                   │ 3/43               │
├──────────────────────────────┴──────────────────────────┴────────────────────┤
│   required:                                                                  │
│     - "projects/alpha/notes/**"                                              │
│   allowed:                                                                   │
│     - "projects/alpha/notes/**"                                              │
│   nullable: true                                                             │
╰──────────────────────────────────────────────────────────────────────────────╯
...

Verbose output shows each field as a record with its required and allowed glob patterns. Fields with required = [] omit the required line. Nullable fields show nullable: true.

Examples

Preview the schema

Use --dry-run to see what init would infer without writing anything:

mdvs init example_kb --dry-run --force

Nothing is written — the output shows the same discovery table, followed by (dry run, nothing written).

Exclude bare files

By default, files without frontmatter are included in the scan. This affects field counts — a bare file at the root means title appears in 37/43 files instead of 37/37:

mdvs init example_kb --dry-run --force --ignore-bare-files
Initialized 37 files — 37 field(s) (dry run)

╭─────────────────────┬───────────────────────┬───────┬────────────────────────╮
│ ...                 │                       │       │                        │
│ "title"             │ String                │ 37/37 │                        │
│ ...                 │                       │       │                        │
╰─────────────────────┴───────────────────────┴───────┴────────────────────────╯

With --ignore-bare-files, only 37 files are scanned and title becomes 37/37. This also affects the inferred required patterns — without bare files diluting the counts, more fields can be required in broader paths.

Errors

ErrorCause
mdvs.toml already existsConfig exists and --force not passed
is not a directoryPath doesn’t exist or isn’t a directory
no markdown files foundNo .md files match the glob pattern

check

Validate frontmatter against the schema.

Usage

mdvs check [path]

Flags

FlagDefaultDescription
path.Directory containing mdvs.toml
--no-updateSkip auto-update before validating

Global flags (-o, -v, --logs) are described in Configuration.

What it does

check reads mdvs.toml, scans every markdown file, and validates each field value against the declared constraints.

By default, check auto-updates the schema before validating (see [check].auto_update). Use --no-update to skip this and validate against the current mdvs.toml as-is.

It reports four kinds of violations:

  • WrongType — value doesn’t match the declared type
  • Disallowed — field appears in a file whose path doesn’t match any allowed glob
  • MissingRequired — file matches a required glob but the field is absent
  • NullNotAllowed — field is null but nullable = false

Fields not in mdvs.toml (and not in the ignore list) are reported as new fields — these are informational and don’t count as violations.

check is read-only — it never modifies mdvs.toml or any files. See Validation for the full rules, including type leniency and null handling.

Output

Compact (default)

When everything passes:

mdvs check example_kb
Checked 43 files — no violations

When violations are found:

Checked 43 files — 3 violation(s), 1 new field(s)

╭──────────────────────────┬─────────────────────────────┬─────────────────────╮
│ "drift_rate"             │ NullNotAllowed              │ 1 file              │
│ "priority"               │ WrongType                   │ 2 files             │
│ "title"                  │ MissingRequired             │ 6 files             │
╰──────────────────────────┴─────────────────────────────┴─────────────────────╯

╭──────────────────────────────┬─────────────────────┬─────────────────────────╮
│ "algorithm"                  │ new                 │ 2 files                 │
╰──────────────────────────────┴─────────────────────┴─────────────────────────╯

Each violation row shows the field name, violation kind, and how many files are affected. New fields appear in a separate table below.

Verbose (-v)

Checked 43 files — 3 violation(s), 1 new field(s)

╭────────────────────────────┬────────────────────────────┬────────────────────╮
│ "drift_rate"               │ NullNotAllowed             │ 1 file             │
├────────────────────────────┴────────────────────────────┴────────────────────┤
│   - "projects/alpha/notes/experiment-2.md"                                   │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────┬─────────────────────────┬───────────────────────╮
│ "priority"                 │ WrongType               │ 2 files               │
├────────────────────────────┴─────────────────────────┴───────────────────────┤
│   - "projects/beta/notes/initial-findings.md" (got String)                   │
│   - "projects/beta/overview.md" (got String)                                 │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────┬───────────────────────────────┬──────────────────────╮
│ "title"               │ MissingRequired               │ 6 files              │
├───────────────────────┴───────────────────────────────┴──────────────────────┤
│   - "README.md"                                                              │
│   - "lab-values.md"                                                          │
│   - "reference/glossary.md"                                                  │
│   - "reference/quick-start.md"                                               │
│   - "reference/tools.md"                                                     │
│   - "scratch.md"                                                             │
╰──────────────────────────────────────────────────────────────────────────────╯

╭──────────────────────────────┬─────────────────────┬─────────────────────────╮
│ "algorithm"                  │ new                 │ 2 files                 │
├──────────────────────────────┴─────────────────────┴─────────────────────────┤
│   - "projects/beta/notes/initial-findings.md"                                │
│   - "projects/beta/notes/replication.md"                                     │
╰──────────────────────────────────────────────────────────────────────────────╯

Verbose output expands each violation into a record with the offending file paths. WrongType violations include the actual type in parentheses (e.g., got String).

Exit codes

CodeMeaning
0All files valid — no violations
1Violations found
2Pipeline error (missing mdvs.toml, invalid config, scan failure)

New fields don’t affect the exit code — they’re informational only.

Errors

ErrorCause
no mdvs.toml foundConfig doesn’t exist — run mdvs init first
mdvs.toml is invalidTOML parsing or schema error — fix the file or run mdvs init --force

update

Re-scan files, infer new fields, and update the schema.

Usage

mdvs update [path] [flags]

Flags

FlagDefaultDescription
path.Directory containing mdvs.toml
--reinfer <field>Re-infer a specific field (repeatable)
--reinfer-allRe-infer all fields from scratch
--dry-runPreview changes without writing anything

--reinfer and --reinfer-all cannot be used together.

Global flags (-o, -v, --logs) are described in Configuration.

What it does

update re-scans the directory using the existing [scan] config, infers types and path patterns from the current files, and merges the results into mdvs.toml. Unlike init, it preserves all existing configuration — only the [fields] section changes.

Default mode

By default, update only discovers new fields — fields that appear in frontmatter but aren’t yet in mdvs.toml (either as [[fields.field]] entries or in the ignore list). Existing fields are protected: their types, allowed/required patterns, and nullable flags don’t change.

Fields that disappear (no longer in any file) are kept in mdvs.toml by default. This is conservative — removing a field from the schema is an explicit action.

--reinfer

Re-infer one or more specific fields. The named fields are removed from mdvs.toml and re-inferred from scratch, as if they’d never been seen. All other fields stay protected.

mdvs update example_kb --reinfer drift_rate --reinfer priority

Fails if a named field isn’t in mdvs.toml.

--reinfer-all

Re-infer every field from scratch. All [[fields.field]] entries are removed and rebuilt from the current files. Fields that no longer exist in any file are reported as removed.

All other config sections ([scan], [embedding_model], [chunking], [search], [update]) are preserved. This is the key difference from init --force, which rewrites the entire mdvs.toml.

Output

Compact (default)

When the schema is already up to date:

Scanned 43 files — no changes (dry run)

When new fields are discovered:

Scanned 44 files — 1 field(s) changed (dry run)

╭────────────────────────┬───────────────────┬───────────────────┬─────────────╮
│ "category"             │ added             │ String            │             │
╰────────────────────────┴───────────────────┴───────────────────┴─────────────╯

When --reinfer detects a type change:

Scanned 44 files — 2 field(s) changed (dry run)

╭────────────────────────┬───────────────────┬───────────────────┬─────────────╮
│ "category"             │ added             │ String            │             │
╰────────────────────────┴───────────────────┴───────────────────┴─────────────╯
╭───────────────────────────────────────────┬──────────────────────────────────╮
│ "drift_rate"                              │ type                             │
╰───────────────────────────────────────────┴──────────────────────────────────╯

When a reinferred field no longer exists:

Scanned 43 files — 1 field(s) changed (dry run)

╭────────────────────────────────────────┬─────────────────────────────────────╮
│ "category"                             │ removed                             │
╰────────────────────────────────────────┴─────────────────────────────────────╯

Verbose (-v)

Added fields show the inferred path patterns:

Scanned 44 files — 1 field(s) changed (dry run)

╭─────────────────────────────┬───────────────────────┬────────────────────────╮
│ "category"                  │ added                 │ String                 │
├─────────────────────────────┴───────────────────────┴────────────────────────┤
│   found in:                                                                  │
│     - "projects/alpha/notes/**"                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

Changed fields show old and new values for each aspect that differs:

╭────────────────────────┬──────────────────┬────────────────┬─────────────────╮
│ field                  │ aspect           │ old            │ new             │
│ "drift_rate"           │ type             │ Float          │ String          │
╰────────────────────────┴──────────────────┴────────────────┴─────────────────╯

Removed fields show where they were previously allowed:

╭──────────────────────────────┬───────────────────────────┬───────────────────╮
│ "category"                   │ removed                   │                   │
├──────────────────────────────┴───────────────────────────┴───────────────────┤
│   previously in:                                                             │
│     - "projects/**"                                                          │
╰──────────────────────────────────────────────────────────────────────────────╯

Verbose output also shows the pipeline steps before the result (Read config, Scan, Infer, Write config, etc.).

Exit codes

CodeMeaning
0Success (changes written, or no changes needed)
2Pipeline error (missing config, scan failure, build failure)

Errors

ErrorCause
no mdvs.toml foundConfig doesn’t exist — run mdvs init first
field '<name>' is not in mdvs.toml--reinfer names a field that doesn’t exist
cannot use --reinfer and --reinfer-all togetherConflicting flags
field name conflicts with internal columnNew field name collides with reserved names

build

Validate, embed, and write the search index.

Usage

mdvs build [path] [flags]

Flags

FlagDefaultDescription
path.Directory containing mdvs.toml
--set-modelChange embedding model (requires --force)
--set-revisionPin model to a specific HuggingFace revision (requires --force)
--set-chunk-sizeChange max chunk size in characters (requires --force)
--forceConfirm config changes or trigger a full rebuild
--no-updateSkip auto-update before building

Global flags (-o, -v, --logs) are described in Configuration.

What it does

build creates (or updates) the search index in .mdvs/. The pipeline:

  1. Read config — parse mdvs.toml. If [embedding_model], [chunking], or [search] sections are missing, they’re added with defaults and written back.

By default, build auto-updates the schema before building (see [build].auto_update). Use --no-update to skip this.

  1. Scan — walk the directory and extract frontmatter.
  2. Validate — check frontmatter against the schema (same as check). If violations are found, the build aborts.
  3. Classify — compare scanned files against the existing index to determine what needs embedding.
  4. Load model — download or load the cached embedding model. Skipped if nothing needs embedding.
  5. Embed — chunk and embed new/edited files.
  6. Write index — write files.parquet and chunks.parquet to .mdvs/.

See Search & Indexing for details on chunking, embedding, and how the index is structured.

Incremental builds

Build is incremental by default. It classifies each file by comparing its content hash against the existing index:

StatusConditionAction
newfile not in existing indexchunk + embed
editedfile in index, content changedchunk + re-embed
unchangedfile in index, content matcheskeep existing chunks
removedfile in index, no longer on diskdrop from index

Content hash covers the file body only (after frontmatter extraction). Frontmatter-only changes don’t trigger re-embedding — but files.parquet is always rewritten with fresh frontmatter from the current scan.

When nothing needs embedding, the model is never loaded.

Config changes

build detects when the embedding configuration has changed since the last build by comparing mdvs.toml against metadata stored in the parquet files. If a mismatch is found, the build refuses to proceed unless you pass --force:

config changed since last build:
  model: 'minishlab/potion-base-8M' → 'minishlab/potion-base-32M'
Use --force to rebuild with new config

The --set-model, --set-revision, and --set-chunk-size flags update mdvs.toml and require --force (since they change the config and trigger a full re-embed). For example, to switch to a larger model:

mdvs build --set-model minishlab/potion-base-32M --force

--set-revision pins the model to a specific HuggingFace commit SHA, ensuring reproducible embeddings even if the model is updated upstream:

mdvs build --set-revision abc123def --force

The revision is stored in mdvs.toml under [embedding_model].revision and checked against the parquet metadata on subsequent builds. See Embedding for the full list of available models.

On the first build (no existing .mdvs/), --force is never needed.

Output

Compact (default)

Incremental build with one new file:

Built index — 44 files, 60 chunks

╭──────────────────────────┬─────────────────────────┬─────────────────────────╮
│ embedded                 │ 1 file                  │ 1 chunk                 │
│ unchanged                │ 43 files                │ 59 chunks               │
╰──────────────────────────┴─────────────────────────┴─────────────────────────╯

When nothing needs embedding:

Built index — 43 files, 59 chunks

╭──────────────────────────┬─────────────────────────┬─────────────────────────╮
│ unchanged                │ 43 files                │ 59 chunks               │
╰──────────────────────────┴─────────────────────────┴─────────────────────────╯

When violations are found, the build aborts:

Build aborted — 6 violation(s) found. Run `mdvs check` for details.

Verbose (-v)

Read config: example_kb/mdvs.toml
Scan: 44 files
Validate: 44 files — no violations
Classify: 44 files (full rebuild)
Load model: "minishlab/potion-base-8M" (256d)
Embed: 44 files (60 chunks)
Write index: 44 files, 60 chunks

Built index — 44 files, 60 chunks (full rebuild)

╭─────────────────────────┬─────────────────────────┬──────────────────────────╮
│ embedded                │ 44 files                │ 60 chunks                │
├─────────────────────────┴─────────────────────────┴──────────────────────────┤
│   - "README.md" (7 chunks)                                                   │
│   - "blog/drafts/grant-ideas.md" (2 chunks)                                  │
│   - "blog/drafts/upcoming-talk.md" (1 chunk)                                 │
│   ...                                                                        │
│   - "scratch.md" (1 chunk)                                                   │
╰──────────────────────────────────────────────────────────────────────────────╯

Verbose output shows each pipeline step with its result, and expands embedded files with per-file chunk counts.

Exit codes

CodeMeaning
0Build completed successfully
1Violations found — build aborted
2Pipeline error (missing config, scan failure, config mismatch, model failure)

Errors

ErrorCause
no mdvs.toml foundConfig doesn’t exist — run mdvs init first
config changed since last buildConfig differs from parquet metadata — use --force
--set-model requires --forceChanging model triggers full re-embed
--set-chunk-size requires --forceChanging chunk size triggers full re-embed
dimension mismatchModel produces different dimensions than existing index (incremental build only — --force bypasses this)

search

Query the index with natural language.

Usage

mdvs search <query> [path] [flags]

Flags

FlagDefaultDescription
query(required)Natural language search query
path.Directory containing mdvs.toml
--limit / -n10Maximum number of results
--whereSQL WHERE clause for filtering on frontmatter fields
--no-updateSkip auto-update
--no-buildSkip auto-build before searching

The default limit can be changed in mdvs.toml via [search].default_limit.

Global flags (-o, -v, --logs) are described in Configuration.

What it does

search loads the index from .mdvs/, embeds the query into a vector using the same model that built the index, and ranks files by cosine similarity. Each file’s score is the best chunk match — the highest similarity across all its chunks. Results are sorted descending (higher = more similar).

By default, search auto-builds the index before querying, which includes auto-updating the schema (see [search].auto_build). Use --no-build to query the existing index as-is, or --no-update to build without updating the schema first.

See Search & Indexing for details on chunking, embedding, scoring, and model identity.

First run

Note: The very first time search (or build) runs, mdvs downloads the embedding model from HuggingFace to a local cache. This is a one-time download — subsequent runs use the cached model and start instantly.

Download size depends on the model:

ModelSize
potion-base-2M~8 MB
potion-base-8M (default)~30 MB
potion-base-32M~120 MB
potion-multilingual-128M~480 MB

After the model is cached, a full build of 500+ files completes in under a second.

--where

Filter results by frontmatter fields using SQL syntax. The filter and similarity ranking are combined in a single query, so files that don’t match are excluded efficiently.

Scalar comparisons:

mdvs search "experiment" --where "status = 'active'"
mdvs search "experiment" --where "sample_count > 20"
mdvs search "experiment" --where "status = 'active' AND priority = 1"

Array fields (via DataFusion array functions):

mdvs search "calibration" --where "array_has(tags, 'biosensor')"

Field names with spaces need double-quoting:

mdvs search "query" --where "\"lab section\" = 'optics'"

See Search Guide for the full --where reference, including nested objects, escaping rules, and more examples.

Output

Compact (default)

mdvs search "experiment" example_kb
Searched "experiment" — 10 hits

╭───────────┬────────────────────────────────────────────────────┬─────────────╮
│ 1         │ "projects/archived/gamma/lessons-learned.md"       │ 0.487       │
│ 2         │ "blog/published/2031/founding-story.md"            │ 0.470       │
│ 3         │ "projects/archived/gamma/post-mortem.md"           │ 0.457       │
│ 4         │ "projects/alpha/notes/experiment-3.md"             │ 0.420       │
│ 5         │ "blog/drafts/grant-ideas.md"                       │ 0.406       │
│ ...       │                                                    │             │
╰───────────┴────────────────────────────────────────────────────┴─────────────╯

Each row shows rank, filename, and cosine similarity score.

With --where filtering:

mdvs search "experiment" example_kb --where "status = 'active'" -n 5
Searched "experiment" — 3 hits

╭───────────────┬──────────────────────────────────────────┬───────────────────╮
│ 1             │ "projects/alpha/overview.md"             │ 0.391             │
│ 2             │ "projects/beta/overview.md"              │ 0.358             │
│ 3             │ "projects/alpha/budget.md"               │ 0.001             │
╰───────────────┴──────────────────────────────────────────┴───────────────────╯

Verbose (-v)

mdvs search "experiment" example_kb -v -n 3
Searched "experiment" — 3 hits

╭──────────┬─────────────────────────────────────────────────────┬─────────────╮
│ 1        │ "projects/archived/gamma/lessons-learned.md"        │ 0.487       │
├──────────┴─────────────────────────────────────────────────────┴─────────────┤
│   lines 17-19:                                                               │
│                                                                              │
│     ## On Timelines                                                          │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────┬─────────────────────────────────────────────────┬───────────────╮
│ 2          │ "blog/published/2031/founding-story.md"         │ 0.470         │
├────────────┴─────────────────────────────────────────────────┴───────────────┤
│   lines 11-11:                                                               │
│     # How Prismatiq Started                                                  │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────┬──────────────────────────────────────────────────┬───────────────╮
│ 3         │ "projects/archived/gamma/post-mortem.md"         │ 0.457         │
├───────────┴──────────────────────────────────────────────────┴───────────────┤
│   lines 1-11:                                                                │
│     ---                                                                      │
│     title: "Project Gamma — Post-Mortem"                                     │
│     ...                                                                      │
╰──────────────────────────────────────────────────────────────────────────────╯
3 hits | model: "minishlab/potion-base-8M" | limit: 10

Verbose output expands each result into a record showing the best-matching chunk text with its line range. The footer shows total hits, model name, and limit.

Exit codes

CodeMeaning
0Search completed (even with 0 results)
2Pipeline error (missing config, missing index, model mismatch, invalid --where)

Errors

ErrorCause
no mdvs.toml foundConfig doesn’t exist — run mdvs init first
index not found.mdvs/ doesn’t exist — run mdvs build first
model mismatchConfig model differs from index — run mdvs build to rebuild
Invalid --whereSQL syntax error or unknown field name

info

Show config and index status.

Usage

mdvs info [path]

Flags

FlagDefaultDescription
path.Directory containing mdvs.toml

Global flags (-o, -v, --logs) are described in Configuration.

What it does

info reads mdvs.toml, counts files on disk, and reads the index metadata from .mdvs/ (if it exists). It displays the current schema and index status without modifying anything.

Use it to check which fields are configured, whether the index is up to date, or if the config has changed since the last build.

Output

Compact (default)

mdvs info example_kb
43 files, 37 fields, 59 chunks

╭──────────────────────────────┬───────────────────────────────────────────────╮
│ model:                       │ minishlab/potion-base-8M                      │
│ config:                      │ match                                         │
│ files:                       │ 43/43                                         │
╰──────────────────────────────┴───────────────────────────────────────────────╯

╭──────────────┬───────────────┬───────────────┬───────────────┬───────────────╮
│ "title"      │ String        │ required: "bl │ allowed: "blo │               │
│              │               │ og/**", ...   │ g/**", ...    │               │
│ "tags"       │ String[]      │ required: "bl │ allowed: "blo │               │
│              │               │ og/published/ │ g/**", ...    │               │
│              │               │ **", ...      │               │               │
│ "draft"      │ Boolean       │ required: "bl │ allowed: "blo │               │
│              │               │ og/**"        │ g/**"         │               │
│ "drift_rate" │ Float?        │ required: "pr │ allowed: "pro │               │
│              │               │ ojects/alpha/ │ jects/alpha/n │               │
│              │               │ notes/**"     │ otes/**"      │               │
│ ...          │               │               │               │               │
╰──────────────┴───────────────┴───────────────┴───────────────┴───────────────╯

The summary line shows files on disk, field count, and chunk count. The index block shows the embedding model, whether the config matches the index (match or changed), and how many files are indexed vs on disk. The field table lists every [[fields.field]] entry with its type, required patterns, and allowed patterns.

When no index has been built:

43 files, 37 fields

The index block is omitted and the summary shows only files and fields.

Verbose (-v)

Read config: example_kb/mdvs.toml
Scan: 43 files
Read index: 43 files, 59 chunks

43 files, 37 fields, 59 chunks

╭────────────────────────────┬─────────────────────────────────────────────────╮
│ model:                     │ minishlab/potion-base-8M                        │
│ revision:                  │ none                                            │
│ chunk size:                │ 1024                                            │
│ built:                     │ 2026-03-13T22:46:02.902129+00:00                │
│ config:                    │ match                                           │
│ files:                     │ 43/43                                           │
╰────────────────────────────┴─────────────────────────────────────────────────╯

╭────────────────────────────────┬────────────────────────┬────────────────────╮
│ "action_items"                 │ String[]               │ 9/43               │
├────────────────────────────────┴────────────────────────┴────────────────────┤
│   required:                                                                  │
│     - "meetings/all-hands/**"                                                │
│     - "projects/alpha/meetings/**"                                           │
│     - "projects/beta/meetings/**"                                            │
│   allowed:                                                                   │
│     - "meetings/**"                                                          │
│     - "projects/alpha/meetings/**"                                           │
│     - "projects/beta/meetings/**"                                            │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────┬────────────────────────┬──────────────────────╮
│ "drift_rate"                 │ Float?                 │ 3/43                 │
├──────────────────────────────┴────────────────────────┴──────────────────────┤
│   required:                                                                  │
│     - "projects/alpha/notes/**"                                              │
│   allowed:                                                                   │
│     - "projects/alpha/notes/**"                                              │
│   nullable: true                                                             │
╰──────────────────────────────────────────────────────────────────────────────╯
...

Verbose output adds pipeline steps, the full index details (revision, chunk size, build timestamp), and expands each field into a record showing its glob patterns. The count column (e.g., 9/43) shows how many scanned files contain the field.

Exit codes

CodeMeaning
0Success (including when no index exists)
2Pipeline error (missing config, parquet read failure)

Errors

ErrorCause
no mdvs.toml foundConfig doesn’t exist — run mdvs init first

clean

Delete the search index.

Usage

mdvs clean [path]

Flags

FlagDefaultDescription
path.Directory containing mdvs.toml

Global flags (-o, -v, --logs) are described in Configuration.

What it does

clean deletes the .mdvs/ directory, which contains the Parquet files that make up the search index. The mdvs.toml configuration file is never touched — you can rebuild the index at any time with build.

The command is idempotent — running it when .mdvs/ doesn’t exist is a no-op. It also refuses to delete if .mdvs/ is a symlink, as a safety measure.

Output

Compact (default)

mdvs clean example_kb
Cleaned "example_kb/.mdvs"

When there’s nothing to clean:

Nothing to clean — "example_kb/.mdvs" does not exist

Verbose (-v)

Delete index: "example_kb/.mdvs" (2 files, 113.6 KB)

Cleaned "example_kb/.mdvs"

2 files | 113.6 KB

Verbose output shows the file count and total size of the deleted directory.

Exit codes

CodeMeaning
0Success (including when nothing to clean)
2Pipeline error (symlink detected, I/O failure)

Errors

ErrorCause
.mdvs is a symlinkRefuses to delete symlinks for safety — remove it manually

Configuration

All configuration lives in mdvs.toml, created by init and updated by update. This page is a complete reference of every section and field.

Sections overview

mdvs.toml has two groups of sections:

Validation (always present):

  • [scan] — file discovery
  • [check] — check command settings
  • [fields] — field definitions and ignore list

Build & search (written by init, model/chunking filled by first build):

Global flags

These flags apply to all commands:

FlagValuesDefaultDescription
-o, --outputtext, jsontextOutput format
-v, --verboseShow detailed output (pipeline steps, expanded records)
--logsinfo, debug, trace(none)Enable diagnostic logging to stderr

[scan]

Controls how markdown files are discovered.

[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
FieldTypeDefaultDescription
globString"**"Glob pattern for matching markdown files
include_bare_filesBooleantrueInclude files without YAML frontmatter
skip_gitignoreBooleanfalseDon’t read .gitignore patterns during scan

When include_bare_files is true, files without frontmatter participate in inference (empty field set) and validation (can trigger MissingRequired). When false, they’re excluded from the scan entirely.

[update]

Placeholder for future update-specific settings. Currently empty — this section is hidden from mdvs.toml by default.

[check]

Check command settings.

[check]
auto_update = true
FieldTypeDefaultDescription
auto_updateBooleanfalseAuto-run update before validating

When auto_update is true, check runs the update pipeline (scan, infer, write config) before validating. Set to false or use --no-update for deterministic CI validation against the committed mdvs.toml.

[embedding_model]

Specifies the embedding model for semantic search. See Embedding for available models.

[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"
FieldTypeDefaultDescription
providerString"model2vec"Embedding provider (currently only "model2vec")
nameString"minishlab/potion-base-8M"HuggingFace model ID
revisionString(none)Pin to a specific HuggingFace commit SHA for reproducibility

The provider field can be omitted — it defaults to "model2vec". The revision field only appears when explicitly set (e.g., via build --set-revision).

Changing the model or revision after a build requires build --force to re-embed all files.

[chunking]

Controls semantic text splitting before embedding.

[chunking]
max_chunk_size = 1024
FieldTypeDefaultDescription
max_chunk_sizeInteger1024Maximum chunk size in characters

The text splitter breaks each file’s body into semantic chunks respecting markdown structure (headings, paragraphs, lists). Changing the chunk size after a build requires build --force.

[build]

Build workflow settings.

[build]
auto_update = true
FieldTypeDefaultDescription
auto_updateBooleanfalseAuto-run update before building

When auto_update is true, build runs the update pipeline before building. Use --no-update to skip.

[search]

Settings for the search command, including how internal columns are named in --where queries.

[search]
default_limit = 10
FieldTypeDefaultDescription
default_limitInteger10Maximum results when --limit is not specified
internal_prefixString""Prefix for internal column names in --where queries
aliasesMap{}Per-column name overrides for internal columns
auto_updateBooleanfalseAuto-run update before building (when auto_build is true)
auto_buildBooleanfalseAuto-run build before searching

Internal column names

Beyond your frontmatter fields, the search index stores bookkeeping columns that mdvs uses internally. These internal columns are available in --where queries:

ColumnContains
filepathRelative file path (e.g., blog/post.md)
file_idUnique identifier for each file
content_hashHash of the file body
built_atTimestamp of last build

By default, these are exposed with their raw names:

--where "filepath LIKE 'blog/%'"

If a frontmatter field name collides with an internal column (e.g., you have a field called filepath), search will error and suggest resolutions:

  1. Set a prefix to namespace all internal columns:

    [search]
    internal_prefix = "_"
    

    Internal columns become _filepath, _file_id, etc.

  2. Set a per-column alias to rename just the colliding column:

    [search.aliases]
    filepath = "path"
    

    The internal column becomes path, your frontmatter filepath stays bare.

  3. Rename the frontmatter field in your markdown files.

Aliases take precedence over the prefix. See the Search Guide for full --where reference.

[fields]

Defines field constraints and the ignore list. This is the largest section — it contains one [[fields.field]] entry per constrained field.

Ignore list

[fields]
ignore = ["internal_id", "temp_notes"]

Fields in the ignore list are known but unconstrained — they skip all validation and are not reported as new fields by check or update. A field cannot be in both ignore and [[fields.field]].

Field definitions

Each [[fields.field]] entry defines constraints on a frontmatter field:

[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "projects/**"]
required = ["blog/**", "projects/**"]
nullable = false
FieldTypeDefaultDescription
nameString(required)Frontmatter key
typeFieldType"String"Expected value type
allowedString[]["**"]Glob patterns where the field may appear
requiredString[][]Glob patterns where the field must be present
nullableBooleantrueWhether null values are accepted

All fields except name have permissive defaults. A minimal entry with just a name:

[[fields.field]]
name = "title"

is equivalent to:

[[fields.field]]
name = "title"
type = "String"
allowed = ["**"]
required = []
nullable = true

This is not the same as putting the field in the ignore list. Both prevent the field from being reported as new during update, but a [[fields.field]] entry tracks the field — it appears in info output with its type and patterns, and can be targeted by update --reinfer. The ignore list simply silences the field: no validation, no detail in info.

Type syntax

Scalar types are plain strings:

type = "String"    # also: "Boolean", "Integer", "Float"

Arrays use an inline table:

type = { array = "String" }

Objects use a nested inline table:

type = { object = { author = "String", count = "Integer" } }

See Types for the full type system, including widening rules.

Path patterns

allowed and required are lists of glob patterns matched against relative file paths:

allowed = ["blog/**", "projects/alpha/**"]
required = ["blog/published/**"]

Patterns must end with /* (direct children) or /** (full subtree), or be exactly * or **. Bare paths like blog or file names like blog/post.md are not valid.

The invariant required ⊆ allowed is enforced — every required glob must be covered by some allowed glob. For example, allowed = ["meetings/**"] covers required = ["meetings/all-hands/**"] because any path matching the required pattern also matches the allowed one.

See Schema Inference for how these patterns are computed.

Example

A representative subset from example_kb/mdvs.toml (37 fields total, 4 shown):

[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false

[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"

[chunking]
max_chunk_size = 1024

[search]
default_limit = 10

[fields]
ignore = []

[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
required = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
nullable = false

[[fields.field]]
name = "tags"
allowed = ["blog/**", "projects/alpha/*", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/*", "projects/beta/notes/**"]
required = ["blog/published/**", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/notes/**"]
nullable = false
type = { array = "String" }

[[fields.field]]
name = "drift_rate"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = true

[[fields.field]]
name = "calibration"
allowed = ["projects/alpha/notes/**"]
required = []
nullable = false
type = { object = { adjusted = { object = { intensity = "Float", wavelength = "Float" } }, baseline = { object = { intensity = "Float", notes = "String", wavelength = "Float" } } } }

Search Guide

The --where flag on search lets you filter results by frontmatter fields using SQL syntax. The filter is combined with similarity ranking in a single query — files that don’t match are excluded before results are returned.

Under the hood, mdvs uses DataFusion as its SQL engine, so any expression valid in DataFusion’s SQL dialect works in --where.

Scalar fields

Use bare field names for simple comparisons:

String

mdvs search "experiment" --where "status = 'active'"
mdvs search "experiment" --where "author = 'Giulia Ferretti'"
mdvs search "experiment" --where "status IN ('active', 'archived')"
mdvs search "experiment" --where "title LIKE '%sensor%'"

Numeric

mdvs search "experiment" --where "sample_count > 20"
mdvs search "experiment" --where "drift_rate >= 0.01 AND drift_rate <= 0.05"
mdvs search "experiment" --where "wavelength_nm BETWEEN 600 AND 800"
Searched "experiment" — 2 hits

╭────────────┬─────────────────────────────────────────────────┬───────────────╮
│ 1          │ "projects/alpha/notes/experiment-3.md"          │ 0.420         │
│ 2          │ "projects/alpha/notes/experiment-1.md"          │ 0.356         │
╰────────────┴─────────────────────────────────────────────────┴───────────────╯

Boolean

mdvs search "announcement" --where "draft = false"
mdvs search "ideas" --where "draft = true"

Null checks

mdvs search "notes" --where "drift_rate IS NOT NULL"
mdvs search "notes" --where "review_score IS NULL"

Combining conditions

Use AND, OR, and NOT to build compound filters:

mdvs search "experiment" --where "status = 'active' AND priority = 1"
mdvs search "notes" --where "author = 'REMO' OR author = 'Marco Bianchi'"
mdvs search "notes" --where "NOT status = 'archived'"

Array fields

Fields typed as String[] (like tags, attendees, action_items) support array functions.

Containment

mdvs search "calibration" --where "array_has(tags, 'calibration')"
Searched "calibration" — 3 hits

╭────────────┬─────────────────────────────────────────────────┬───────────────╮
│ 1          │ "projects/alpha/notes/experiment-1.md"          │ 0.478         │
│ 2          │ "projects/alpha/overview.md"                    │ 0.462         │
│ 3          │ "projects/alpha/notes/experiment-3.md"          │ 0.424         │
╰────────────┴─────────────────────────────────────────────────┴───────────────╯

The SQL-standard ANY syntax also works:

mdvs search "calibration" --where "'calibration' = ANY(tags)"

Multiple tags

Combine with AND to require multiple values:

mdvs search "calibration" --where "array_has(tags, 'calibration') AND array_has(tags, 'SPR-A1')"

Array length

mdvs search "meeting" --where "array_length(action_items) > 2"

Filtering by file path

Filter results by file path using the filepath column:

mdvs search "experiment" --where "filepath LIKE 'projects/alpha/%'"
Searched "experiment" — 3 hits

╭────────────┬─────────────────────────────────────────────────┬───────────────╮
│ 1          │ "projects/alpha/notes/experiment-3.md"          │ 0.420         │
│ 2          │ "projects/alpha/overview.md"                    │ 0.391         │
│ 3          │ "projects/alpha/meetings/2031-08-20.md"         │ 0.386         │
╰────────────┴─────────────────────────────────────────────────┴───────────────╯

File paths are stored as relative paths (e.g., projects/alpha/notes/experiment-1.md), so use LIKE with % for path prefix matching:

# All blog posts
--where "filepath LIKE 'blog/%'"

# Only published blog posts
--where "filepath LIKE 'blog/published/%'"

# Files in any meetings directory
--where "filepath LIKE '%/meetings/%'"

Nested objects

Fields typed as Object (like calibration in example_kb) are stored as nested Struct columns. Access nested values with bracket notation:

mdvs search "sensor" --where "calibration['baseline']['wavelength'] > 600"
Searched "sensor" — 2 hits

╭────────────┬─────────────────────────────────────────────────┬───────────────╮
│ 1          │ "projects/alpha/notes/experiment-2.md"          │ 0.414         │
│ 2          │ "projects/alpha/notes/experiment-1.md"          │ 0.362         │
╰────────────┴─────────────────────────────────────────────────┴───────────────╯

The top-level field name (calibration) can be used bare. Only the nested access needs brackets:

# These are equivalent:
--where "calibration['baseline']['wavelength'] > 600"
--where "_data['calibration']['baseline']['wavelength'] > 600"

Field names with special characters

Some field names need quoting in SQL. The init, update, and info commands show hints in their output when this applies.

Spaces

Double-quote the field name:

mdvs search "query" --where "\"lab section\" = 'optics'"

Single quotes in field names

Also use double-quoting:

mdvs search "query" --where "\"author's_note\" IS NOT NULL"

Double quotes in field names

Double the double quotes inside the identifier:

mdvs search "query" --where "\"notes\"\"v2\"\" = true"

String values with special characters

To include a literal single quote inside a string value, double it:

mdvs search "query" --where "title = 'What''s New?'"

mdvs validates quote balance before running the query. If you see “unmatched single quote”, check that every ' in a value is doubled.

Tips

  • Case sensitivity: field names and string values are case-sensitive. Use LOWER() for case-insensitive matching:

    --where "LOWER(author) = 'giulia ferretti'"
    
  • LIKE patterns: % matches any sequence, _ matches a single character:

    --where "title LIKE 'Project%'"       # starts with "Project"
    --where "title LIKE '%sensor%'"       # contains "sensor"
    
  • NULL semantics: comparisons against NULL always return false. Use IS NULL / IS NOT NULL, not = NULL.

  • No aggregates in –where: functions like COUNT() or SUM() don’t work in --where — the filter applies per-file, not across results.

Obsidian

mdvs works well with Obsidian vaults — it can validate your YAML frontmatter for consistency and provide semantic search across all your notes. Everything runs locally, no external services needed.

Setup

Point mdvs at your vault:

mdvs init path/to/vault

This scans all markdown files, infers a typed schema from your frontmatter, and writes mdvs.toml. If auto-build is enabled (the default), it also downloads the embedding model and builds the search index.

Two artifacts are created:

  • mdvs.toml — commit this to version control
  • .mdvs/ — add to .gitignore (search index, can be rebuilt)

.gitignore

mdvs respects .gitignore by default. If your vault has .obsidian/ in .gitignore (many do), those files are automatically excluded from scanning. No extra configuration needed.

.mdvsignore

For additional exclusions, create a .mdvsignore file at the vault root. It uses the same syntax as .gitignore:

# AI working directories
.claude/
.gemini/

# Template files (if using Templater)
_templates/

# Attachments (no frontmatter)
attachments/
assets/

Any directory that doesn’t contain markdown with frontmatter is a good candidate for exclusion — it speeds up scanning and avoids noise in the schema.

Common frontmatter patterns

Obsidian vaults typically use frontmatter like:

---
title: My Note
tags: [project, research]
status: active
date: 2026-03-14
draft: false
---

mdvs infers types automatically:

FieldInferred typeNotes
titleString
tagsString[]Array of strings
statusString
dateStringNo Date type yet — dates are stored as strings
draftBoolean

Inconsistent types

If the same field has different types across notes (e.g., priority is an integer in some files and a string like "high" in others), mdvs widens to the broadest compatible type — usually String. See Types & Widening for the full rules.

Dataview fields

If you use the Dataview plugin, its inline fields (e.g., key:: value) are not picked up by mdvs — only YAML frontmatter between --- fences is scanned. Dataview fields that appear in the YAML block are handled normally.

Validation

Once mdvs.toml exists, use check to verify your frontmatter:

mdvs check path/to/vault

This catches:

  • Wrong types — a Boolean field with a string value
  • Missing required fields — a field that should be present in certain directories
  • Disallowed fields — a field appearing where it shouldn’t
  • Null violations — null where it’s not allowed

See Validation for the full rules.

Tightening constraints

The inferred schema is permissive by default. To enforce stricter rules, edit mdvs.toml directly. For example, to require tags in all daily notes:

[[fields.field]]
name = "tags"
type = { array = "String" }
allowed = ["**"]
required = ["daily/**"]
nullable = false

Updating the schema

When you introduce new frontmatter fields, run update to incorporate them:

mdvs update path/to/vault

This discovers new fields and adds them to mdvs.toml without touching existing field definitions. Use --reinfer to re-infer specific fields if you’ve reorganized your vault.

Search

Build the index and search:

mdvs build path/to/vault
mdvs search "topic of interest" path/to/vault

Filter with --where on your frontmatter:

# Only active notes
mdvs search "topic" path/to/vault --where "status = 'active'"

# Notes with a specific tag
mdvs search "topic" path/to/vault --where "array_has(tags, 'research')"

# Notes in a specific directory
mdvs search "topic" path/to/vault --where "filepath LIKE 'projects/%'"

See the Search Guide for the full --where reference.

Tips

  • Incremental builds — only notes whose body changed since the last build are re-embedded. Frontmatter-only changes (updating tags, status) don’t trigger re-embedding. Run mdvs build freely — it’s fast when nothing changed.

  • Alongside Obsidian search — mdvs search is semantic (finds conceptually related notes), while Obsidian’s built-in search is keyword-based. They complement each other.

  • Large vaults — mdvs has been tested on vaults with 500+ files and 2000+ chunks. A full build from scratch completes in under a second. Subsequent builds are incremental, re-embedding only changed files.

  • Ignore noisy fields — if some frontmatter fields are auto-generated and you don’t want to validate them, add them to the ignore list in mdvs.toml:

    [fields]
    ignore = ["cssclass", "kanban-plugin"]
    

CI

TBD — this page will cover running mdvs check as a CI linter in GitHub Actions.