Configuration
All configuration lives in mdvs.toml, created by init and updated by update. This page is a complete reference of every section and field.
Sections overview
mdvs.toml has two groups of sections:
Validation (always present):
[scan]— file discovery[check]— check command settings[fields]— field definitions and ignore list
Build & search (written by init, model/chunking filled by first build):
[embedding_model]— model identity[chunking]— chunk sizing[build]— build workflow settings[search]— search defaults and auto-build/update
Global flags
These flags apply to all commands:
| Flag | Values | Default | Description |
|---|---|---|---|
-o, --output | text, json | text | Output format |
-v, --verbose | Show detailed output (pipeline steps, expanded records) | ||
--logs | info, debug, trace | (none) | Enable diagnostic logging to stderr |
[scan]
Controls how markdown files are discovered.
[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
frontmatter_format = "auto"
| Field | Type | Default | Description |
|---|---|---|---|
glob | String | "**" | Glob pattern for matching markdown files |
include_bare_files | Boolean | true | Include files without frontmatter |
skip_gitignore | Boolean | false | Don’t read .gitignore patterns during scan |
frontmatter_format | String | "auto" | Which frontmatter format(s) to accept — see Frontmatter format |
When include_bare_files is true, files without frontmatter participate in inference (empty field set) and validation (can trigger MissingRequired). When false, they’re excluded from the scan entirely.
Frontmatter format
mdvs accepts YAML, TOML, and JSON frontmatter. The frontmatter_format field takes one of four values:
| Value | Behavior |
|---|---|
"auto" (default) | Detect per file from the opening delimiter. See the probe table below. |
"yaml" | Parse every file as YAML; reject +++ or {-opened files with a clear error. |
"toml" | Parse every file as TOML; reject --- or {-opened files. |
"json" | Parse every file as JSON; reject --- or +++-opened files. |
In auto mode (the default), mdvs reads the first non-empty line of each file to pick the engine:
| First non-empty line of a file | Format used |
|---|---|
--- | YAML |
+++ | TOML |
starts with { | JSON (Hugo convention — the braces are part of the JSON object) |
| anything else | treated as a bare file (no frontmatter) |
The probe is one line per file. A single vault can mix all three formats freely.
The forced modes ("yaml" / "toml" / "json") skip the probe and assume every scanned file uses that format. Files whose actual leading delimiter belongs to a different format produce a FrontmatterUnrepresentable error naming both the configured and detected formats. This is useful for opinionated repos (e.g., a Hugo site committed to TOML that wants mdvs check to fail loudly if someone slips in a --- file).
Naming note. frontmatter_format = "toml" controls how mdvs parses frontmatter in .md files. It has nothing to do with mdvs.toml itself — mdvs.toml is always TOML because it’s a config file. Two unrelated uses of “TOML” in the project.
[update]
Placeholder for future update-specific settings. Currently empty — this section is hidden from mdvs.toml by default.
[check]
Check command settings.
[check]
auto_update = true
| Field | Type | Default | Description |
|---|---|---|---|
auto_update | Boolean | false | Auto-run update before validating |
When auto_update is true, check runs the update pipeline (scan, infer, write config) before validating. Set to false or use --no-update for deterministic CI validation against the committed mdvs.toml.
[embedding_model]
Specifies the embedding model for semantic search. See Embedding for available models.
[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"
| Field | Type | Default | Description |
|---|---|---|---|
provider | String | "model2vec" | Embedding provider (currently only "model2vec") |
name | String | "minishlab/potion-base-8M" | HuggingFace model ID |
revision | String | (none) | Pin to a specific HuggingFace commit SHA for reproducibility |
The provider field can be omitted — it defaults to "model2vec". The revision field only appears when explicitly set (e.g., via build --set-revision).
Changing the model or revision after a build requires build --force to re-embed all files.
[chunking]
Controls semantic text splitting before embedding.
[chunking]
max_chunk_size = 1024
| Field | Type | Default | Description |
|---|---|---|---|
max_chunk_size | Integer | 1024 | Maximum chunk size in characters |
The text splitter breaks each file’s body into semantic chunks respecting markdown structure (headings, paragraphs, lists). Changing the chunk size after a build requires build --force.
[build]
Build workflow settings.
[build]
auto_update = true
| Field | Type | Default | Description |
|---|---|---|---|
auto_update | Boolean | false | Auto-run update before building |
When auto_update is true, build runs the update pipeline before building. Use --no-update to skip.
[search]
Settings for the search command, including how internal columns are named in --where queries.
[search]
default_limit = 10
| Field | Type | Default | Description |
|---|---|---|---|
default_limit | Integer | 10 | Maximum results when --limit is not specified |
internal_prefix | String | "" | Prefix for internal column names in --where queries |
aliases | Map | {} | Per-column name overrides for internal columns |
auto_update | Boolean | false | Auto-run update before building (when auto_build is true) |
auto_build | Boolean | false | Auto-run build before searching |
Internal column names
Beyond your frontmatter fields, the search index stores bookkeeping columns that mdvs uses internally. These internal columns are available in --where queries:
| Column | Contains |
|---|---|
filepath | Relative file path (e.g., blog/post.md) |
file_id | Unique identifier for each file |
chunk_text | The plain-text body of each chunk (useful for --where "chunk_text LIKE '%foo%'") |
content_hash | Hash of the file body |
built_at | Timestamp of last build |
(Other columns — chunk_id, chunk_index, start_line, end_line, embedding — exist too but are rarely useful in --where.)
By default, these are referenced by their raw names:
--where "filepath LIKE 'blog/%'"
If a frontmatter field name collides with an internal column name (e.g., you have a field called filepath), the search command will error and suggest resolutions:
-
Set a prefix so internal columns are addressed with a leading marker in
--where:[search] internal_prefix = "_"Now
_filepath,_file_id, etc. refer to the internal columns in--whereclauses, leaving the barefilepathfree to mean your frontmatter field. (The on-disk column names don’t change — only how the--wheretranslator interprets them.) -
Set a per-column alias to rename just the colliding column in
--where:[search.aliases] filepath = "path"Now
pathrefers to the internalfilepathcolumn, and barefilepathrefers to your frontmatter field. -
Rename the frontmatter field in your markdown files.
Aliases take precedence over the prefix. See the Search Guide for full --where reference.
[fields]
Defines field constraints and the ignore list. This is the largest section — it contains one [[fields.field]] entry per constrained field.
Ignore list
[fields]
ignore = ["internal_id", "temp_notes"]
Fields in the ignore list are known but unconstrained — they skip all validation and are not reported as new fields by check or update. A field cannot be in both ignore and [[fields.field]].
Field definitions
Each [[fields.field]] entry defines constraints on a frontmatter field:
[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "projects/**"]
required = ["blog/**", "projects/**"]
nullable = false
| Field | Type | Default | Description |
|---|---|---|---|
name | String | (required) | Frontmatter key |
type | FieldType | "String" | Expected value type |
allowed | Array(String) | ["**"] | Glob patterns where the field may appear |
required | Array(String) | [] | Glob patterns where the field must be present |
nullable | Boolean | true | Whether null values are accepted |
constraints | Table | (absent) | Optional value constraints (see Constraints) |
preprocess | Array(String) | [] | Stage 2 value preprocessors — see Preprocessors |
All fields except name have permissive defaults. A minimal entry with just a name:
[[fields.field]]
name = "title"
is equivalent to:
[[fields.field]]
name = "title"
type = "String"
allowed = ["**"]
required = []
nullable = true
This is not the same as putting the field in the ignore list. Both prevent the field from being reported as new during update, but a [[fields.field]] entry tracks the field — it appears in info output with its type and patterns, and can be targeted by update reinfer. The ignore list simply silences the field: no validation, no detail in info.
Type syntax
Scalar types are plain strings:
type = "String" # also: "Boolean", "Integer", "Float", "Date", "DateTime"
Date and DateTime accept RFC 3339 values only (YYYY-MM-DD for Date, YYYY-MM-DDTHH:MM:SS[.frac]<Z|±HH:MM> for DateTime). See Date and DateTime for the exact accepted shapes and storage semantics.
Arrays use a function-style string:
type = "Array(String)"
Structured types are not supported on disk. Nested Objects in frontmatter are expressed via dotted-name leaf fields — see Types for the flattening rule. Arrays of structured items (Array(Object{...})) have no first-class representation in v0; use parallel scalar arrays as a workaround:
# Instead of an unsupported Array(Object{timestamp, value}):
[[fields.field]]
name = "measurement_timestamps"
type = "Array(String)"
[[fields.field]]
name = "measurement_values"
type = "Array(Float)"
The valid type grammar is:
Type := Scalar | Array(Scalar)
Scalar := String | Integer | Float | Boolean | Date | DateTime
See Types for the full type system, including widening rules.
Path patterns
allowed and required are lists of glob patterns matched against relative file paths:
allowed = ["blog/**", "projects/alpha/**"]
required = ["blog/published/**"]
Patterns must end with /* (direct children) or /** (full subtree), or be exactly * or **. Bare paths like blog or file names like blog/post.md are not valid.
The invariant required ⊆ allowed is enforced — every required glob must be covered by some allowed glob. For example, allowed = ["meetings/**"] covers required = ["meetings/all-hands/**"] because any path matching the required pattern also matches the allowed one.
See Schema Inference for how these patterns are computed.
Constraints
The optional [fields.field.constraints] sub-table adds value constraints beyond type checking.
categories — restricts values to an enumerated set (String, Integer, or arrays of either):
[[fields.field]]
name = "status"
type = "String"
[fields.field.constraints]
categories = ["active", "archived", "completed", "draft", "published"]
min / max — restricts numeric values to an inclusive range (Integer, Float, or arrays of either). Both bounds are optional:
[[fields.field]]
name = "rating"
type = "Integer"
[fields.field.constraints]
min = 1
max = 5
min_length / max_length — bounds string length (Unicode scalar count) or array length:
[[fields.field]]
name = "slug"
type = "String"
[fields.field.constraints]
min_length = 3
max_length = 64
pattern — regex applied to string values, compiled at config load:
[[fields.field]]
name = "version"
type = "String"
[fields.field.constraints]
pattern = '^v\d+\.\d+\.\d+$'
Categories are auto-inferred during init and update reinfer. Range constraints are not auto-inferred but can be inferred on demand with update reinfer <field> --with=range. Length and pattern are not auto-inferred — add them by hand. See Constraints for the full reference.
Preprocessors
The optional preprocess array on a field declares value transformations that run before validation. Two built-in stages:
| Stage | Applies to | Effect |
|---|---|---|
coerce-to-string | String, Array(String) | Serialize non-string JSON values to their JSON string form before validation |
widen-int-to-float | Float, Array(Float) | Treat integer values as their float equivalent |
[[fields.field]]
name = "priority"
type = "String"
preprocess = ["coerce-to-string"]
Preprocessors are auto-inferred during init and update reinfer based on observed type-widening events: a field that widened to String because of mixed-type observations gets coerce-to-string; a Float field that observed integers gets widen-int-to-float. An empty preprocess array means strict validation — no coercion.
Each entry must be applicable to the field’s type, and duplicates are rejected at config load. See Types & Widening for the full rules.
Inference thresholds
Two optional fields in [fields] control categorical auto-inference:
[fields]
max_categories = 10
min_category_repetition = 3
| Field | Type | Default | Description |
|---|---|---|---|
max_categories | Integer | 10 | Max distinct values for a field to be inferred as categorical |
min_category_repetition | Integer | 3 | Min average repetition (occurrences / distinct) for categorical inference |
These are hidden from mdvs.toml when set to their defaults. They only affect auto-inference — manually written categories are unaffected.
Example
A representative subset from example_kb/mdvs.toml (37 fields total, 4 shown):
[scan]
glob = "**"
include_bare_files = true
skip_gitignore = false
[embedding_model]
provider = "model2vec"
name = "minishlab/potion-base-8M"
[chunking]
max_chunk_size = 1024
[search]
default_limit = 10
[fields]
ignore = []
[[fields.field]]
name = "title"
type = "String"
allowed = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
required = ["blog/**", "meetings/**", "people/**", "projects/**", "reference/protocols/**"]
nullable = false
[[fields.field]]
name = "tags"
type = "Array(String)"
allowed = ["blog/**", "projects/alpha/*", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/*", "projects/beta/notes/**"]
required = ["blog/published/**", "projects/alpha/notes/**", "projects/archived/**", "projects/beta/notes/**"]
nullable = false
[[fields.field]]
name = "drift_rate"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = true
# Nested YAML (calibration.baseline.wavelength, etc.) is expressed as
# one [[fields.field]] per leaf — see Types.
[[fields.field]]
name = "calibration.baseline.wavelength"
type = "Float"
allowed = ["projects/alpha/notes/**"]
required = []
nullable = false