Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Types & Widening

mdvs infers a type for every frontmatter field it encounters. When the same field appears with different types across files, mdvs resolves the conflict automatically through type widening.

The supported types

TypeYAML exampleexample_kb field
Booleandraft: falsedraft in blog posts
Integersample_count: 24sample_count in experiments
Floatdrift_rate: 0.023drift_rate in experiments
Stringauthor: Giulia Ferrettiauthor across many files
Datejoined: 2023-02-01joined, date, commission_date, last_reviewed
DateTimesynced_at: "2024-04-02T16:14:30+02:00"synced_at in experiments
Array(Scalar)tags: [calibration, SPR-A1]tags in projects and blog

The on-disk type grammar is tight:

Type   := Scalar | Array(Scalar)
Scalar := String | Integer | Float | Boolean | Date | DateTime

Array(Array(...)) and Array(Object{...}) are not representable on disk — see Arrays of structured items below for the workaround.

Date and DateTime are described in detail in Date and DateTime below.

Nested Objects in YAML are expressed as dotted-name leaf fields in mdvs.toml. A frontmatter shape like:

calibration:
  baseline:
    wavelength: 632.8
    intensity: 0.95
  adjusted:
    wavelength: 633.1
    intensity: 0.97

infers as five separate leaf fields, one per nested path:

  • calibration.baseline.wavelength → Float
  • calibration.baseline.intensity → Float
  • calibration.adjusted.wavelength → Float
  • calibration.adjusted.intensity → Float

Each leaf gets its own nullability and allowed/required glob set. This avoids the readability and per-leaf-validation problems of monolithic Object types. Top-level Object types are not supported in mdvs.toml, and neither are Objects nested inside Array fields — see Arrays of structured items below.

Arrays of structured items

A YAML field like:

measurements:
  - timestamp: "14:02:11"
    value: 0.612
  - timestamp: "14:03:00"
    value: 0.598

has no first-class representation on disk in v0. Inference detects the Array(Object{...}) shape, skips the field, and emits a warning to stderr:

warning: skipped field 'measurements' — Array(Object{...}) isn't representable on disk.
  Consider parallel scalar arrays (see TODO-0156). (first observed in projects/alpha/notes/experiment-2.md)

The recommended workaround is parallel scalar arrays — one field per element-leaf. Replace the YAML above with:

measurement_timestamps: ["14:02:11", "14:03:00"]
measurement_values: [0.612, 0.598]

and the corresponding mdvs.toml:

[[fields.field]]
name = "measurement_timestamps"
type = "Array(String)"

[[fields.field]]
name = "measurement_values"
type = "Array(Float)"

The downside is the loss of per-element grouping — there’s no schema-level guarantee that measurement_timestamps[3] and measurement_values[3] belong to the same record. A first-class Array-of-structured-item representation is tracked in TODO-0156.

Date and DateTime

Both types use RFC 3339 as the canonical wire format — a strict subset of ISO 8601 designed for machine interoperability.

Date — calendar date, no time

Date   = YYYY-MM-DD

Rules:

  • Exactly 4-digit year, 2-digit month, 2-digit day (no 2024-1-1 shorthand).
  • Hyphen separators.
  • Calendar-valid: 2024-13-01 (month 13) and 2024-02-30 (no Feb 30th) are rejected.
  • No time component, no timezone.

Accepted:

2023-02-01
1990-05-12
2024-02-29        ← valid leap-year date

Rejected:

2024-1-1          ← single-digit components not allowed
2024-13-01        ← month must be 01-12
"see 2024-01-15"  ← must be the whole string
2024/01/15        ← only hyphens

Stored as Arrow Date32 (days since 1970-01-01). Native date arithmetic works in --where queries — e.g. WHERE date > '2024-01-01', WHERE date_part('year', published) = 2024, WHERE date BETWEEN '2024-06-01' AND '2024-06-30'. See Date and DateTime in –where queries for worked examples including EXTRACT, INTERVAL, date subtraction, and compound filters.

DateTime — date + time, mandatory timezone

DateTime = YYYY-MM-DDTHH:MM:SS[.frac]<tz>

<tz>     = 'Z'                        ← UTC shorthand
         | '+HH:MM'                   ← positive offset
         | '-HH:MM'                   ← negative offset

Rules:

  • Date part: same as Date above.
  • T separator between date and time is mandatory — no space alternative.
  • HH:MM:SS (24-hour, all two digits). Seconds are required.
  • Fractional seconds optional, any number of digits.
  • Timezone is mandatory — naive 2024-01-15T14:30:00 is rejected (not valid RFC 3339).

Accepted:

2024-01-15T14:30:00Z              ← Zulu = UTC
2024-01-15T14:30:00+00:00         ← same moment, explicit offset
2024-04-02T16:14:30+02:00         ← positive offset
2024-01-15T14:30:00-08:00         ← negative offset
2024-01-15T14:30:00.123Z          ← fractional seconds
2024-01-15T14:30:00.123456789Z    ← nanosecond precision

Rejected:

2024-01-15T14:30:00               ← no timezone
2024-01-15 14:30:00Z              ← space instead of T
2024-01-15T14:30                  ← seconds required
2024-13-01T14:30:00Z              ← invalid month
2024-01-15T25:30:00Z              ← invalid hour

Stored as Arrow Timestamp(Millisecond, "UTC"). Offsets are normalized to UTC at storage time — 2024-04-02T16:14:30+02:00 and 2024-04-02T14:14:30Z are the same absolute moment and store identically. The original offset is intentionally not preserved.

example_kb demonstration

Both types are auto-inferred in the example vault:

FieldTypeFiles
joinedDatepeople/**
dateDatemeetings + blog/published
commission_dateDatepeople/*
last_reviewedDatereference/protocols/**
synced_atDateTimeexperiment-1.md uses Z, experiment-2.md uses +02:00

No manual configuration was needed for any of these — inference detects the RFC 3339 shape and assigns the appropriate type. See Type widening in practice below for the inference rule.

Validation

JSON Schema’s format: date and format: date-time keywords validate values at check time. Bad shapes (invalid calendar dates, missing timezones, wrong separators) produce WrongType violations with a rule like format date or format date-time.

Constraints

  • categories applies (e.g. categories = ["2024-01-01", "2024-12-31"] on a Date field; values are strings, the runtime format validator catches malformed entries).
  • pattern, min, max, min_length, max_length do not apply — the type’s format is itself the pattern. Bounded date ranges (e.g. “published in 2024”) are tracked as a future feature.

Preprocessors

No preprocessor applies to Date or DateTime in v1. Unlike String (which can opt in to coerce-to-string) or Float (which can opt in to widen-int-to-float), date types are strict — either the string parses as RFC 3339 or it doesn’t.

Type hierarchy

When two values have different types, mdvs widens to a common type. The hierarchy looks like this:

graph BT
    Integer --> Float
    Float --> String
    Boolean --> String
    Date --> String
    DateTime --> String
    Array["Array(T)"] --> String

Each arrow means “widens to.” String is the top type — every type eventually reaches it.

The one special case is Integer → Float: integers widen to floats (not directly to String) because the conversion is lossless. Date and DateTime have no internal cross-promotion — mixed Date + DateTime observations widen to String (the two shapes are disjoint).

Two same-category combinations widen internally instead of jumping to String:

  • Array + Array — element types are widened recursively (e.g., Array(Integer) + Array(String)Array(String))
  • Object + Object — at the leaf level: each dotted path’s type is widened independently across files. A file with cal.wave = 850 (Integer) and another with cal.wave = 632.8 (Float) yields cal.wave: Float. New leaf paths in some files are added to the schema; leaves absent from some files affect nullability/required-globs naturally.

Everything else (Boolean + any other type, Array + scalar, Object + scalar) widens to String. The one exception is Array containing ObjectArray(Object{...}) isn’t representable on disk, so inference drops the field with a warning instead of widening to String (see Arrays of structured items).

Type widening in practice

When mdvs scans your files and the same field has different types, it picks the least upper bound — the most specific type that covers all observed values.

Integer + Float → Float

In example_kb, the wavelength_nm field appears in three experiment notes:

# experiment-1.md
wavelength_nm: 850       # Integer

# experiment-2.md
wavelength_nm: 632.8     # Float

# experiment-3.md
wavelength_nm: 780.0     # Float

Result: wavelength_nm is inferred as Float. The integer 850 is safely represented as a float.

Integer + String → String

The priority field uses numbers in one project and text in another:

# projects/alpha/overview.md
priority: 1              # Integer

# projects/beta/overview.md
priority: high           # String

Result: priority is inferred as String. There’s no numeric type that can hold "high", so mdvs widens to String.

Boolean + any non-Boolean → String

If the same field is true in one file and 3 in another, there’s no numeric or boolean type that can hold both. The result is String.

This doesn’t happen in example_kb because booleans (draft) are used consistently — but it’s a common mistake in organically grown vaults where someone writes draft: yes (String) instead of draft: true (Boolean).

Date and DateTime inference

A string is inferred as Date or DateTime when every observation across all files matches the RFC 3339 shape AND parses as a real value. A single non-matching value downgrades the whole field to String.

Pure-date observations across files:

# people/alice.md
joined: 2023-02-01

# people/bob.md
joined: 2024-09-15

Result: joined is inferred as Date.

One non-date value forces String:

# people/alice.md
joined: 2023-02-01

# people/carol.md
joined: "see HR records"        # not a date

Result: joined widens to String — the second observation can’t be typed as Date, and Date + String → String is the widening rule.

Same logic for invalid calendar dates:

# fileA.md
published: 2024-06-01

# fileB.md
published: 2024-13-01           # invalid month — typed String per-value

Result: published widens to String. The typo gets silently absorbed into String typing; the user only catches it via a WrongType violation if they manually set type = "Date" in mdvs.toml.

Date + DateTime are cross-shape — never auto-promote:

# meeting/a.md
when: 2024-01-15                # Date

# meeting/b.md
when: 2024-01-15T14:30:00Z      # DateTime

Result: when widens to String. Pick one shape consistently to get a typed field.

Array element widening

The tags field is a string array in most files, but one file accidentally used integers:

# projects/alpha/overview.md
tags:
  - biosensor
  - metamaterial          # Array(String)

# projects/beta/notes/replication.md
tags:
  - 1
  - 2
  - 3                     # Array(Integer)

Result: tags is inferred as Array(String). The array element types (String vs Integer) are widened to String, giving Array(String).

Object leaf merging (dotted-name flattening)

When two files have nested keys at the same paths, each leaf is inferred independently. New leaves seen in one file but not another are added to the schema; their required glob naturally narrows to just the files that contain them.

In example_kb, the calibration object appears in two experiment files with different structures:

# experiment-1.md (simpler calibration, integer values)
calibration:
  baseline:
    wavelength: 850            # Integer
    intensity: 1               # Integer
    notes: "initial reference" # only in this file

# experiment-2.md (full calibration, float values)
calibration:
  baseline:
    wavelength: 632.8          # Float
    intensity: 0.95            # Float
  adjusted:                    # only in this file
    wavelength: 633.1
    intensity: 0.97

Result: five dotted-name leaf fields are inferred in mdvs.toml:

[[fields.field]]
name = "calibration.adjusted.intensity"
type = "Float"

[[fields.field]]
name = "calibration.adjusted.wavelength"
type = "Float"

[[fields.field]]
name = "calibration.baseline.intensity"
type = "Float"
preprocess = ["widen-int-to-float"]   # Integer + Float mix → opted in

[[fields.field]]
name = "calibration.baseline.notes"
type = "String"

[[fields.field]]
name = "calibration.baseline.wavelength"
type = "Float"
preprocess = ["widen-int-to-float"]

What happened:

  • calibration.baseline.wavelength seen as both Integer (850) and Float (632.8) → widened to Float with widen-int-to-float preprocessor recording the mix
  • calibration.baseline.intensity similar: Integer (1) + Float (0.95) → Float with the preprocessor
  • calibration.baseline.notes only in experiment-1 → still inferred as String (with a required glob narrowed to just the files that have it)
  • calibration.adjusted.* only in experiment-2 → inferred from that file alone

The user-facing schema is flat, but its semantics still match the YAML’s nested shape. Validation, storage, and --where queries all operate on the natural nested structure — the dotted-name form is purely a mdvs.toml UX choice.

The full widening matrix

Every possible combination of types and its result:

BooleanIntegerFloatStringDateDateTimeArrayObject
BooleanBooleanStringStringStringStringStringStringString
IntegerStringIntegerFloatStringStringStringStringString
FloatStringFloatFloatStringStringStringStringString
StringStringStringStringStringStringStringStringString
DateStringStringStringStringDateStringStringString
DateTimeStringStringStringStringStringDateTimeStringString
ArrayStringStringStringStringStringStringArray*dropped**
ObjectStringStringStringStringStringStringdropped**Object*

* Array + Array: element types are widened recursively.

* Object + Object: not a top-level on-disk type. Nested Objects in YAML flatten to dotted-name leaves before widening; each leaf path is widened independently.

** Inference observed Array(Object{…}) — not representable on disk in v0. The field is dropped from the schema and a warning is emitted (see Arrays of structured items).

Date and DateTime are cross-shape — they never auto-promote into each other. The single non-trivial pair is Date + DateTime → String.

The matrix is symmetric — widen(A, B) always equals widen(B, A).

Nullable

Separately from the type, mdvs tracks whether null was observed for a field. This is shown as a ? suffix in output — e.g., Float? means “Float, but sometimes null.”

How it works

In example_kb, the drift_rate field is Float in two experiment files but null in a third:

# experiment-1.md
drift_rate: 0.023        # Float

# experiment-2.md
drift_rate: null          # sensor malfunction — Giulia discarded the data

# experiment-3.md
drift_rate: 0.012         # Float

Result: drift_rate is inferred as Float? — the type is Float (null doesn’t affect the type), and nullable is set to true.

Null-only fields

If the only value ever observed is null, the type defaults to String:

# blog/drafts/grant-ideas.md
review_score: null        # no real values seen

Result: review_score is inferred as String?.

Key rules

  • Null is transparent in widening — it doesn’t affect the inferred type
  • Null-only fields default to String (the safest fallback)
  • nullable is a separate boolean, not part of the type itself
  • In validation: null values skip type checks, but a non-nullable required field with a null value triggers a NullNotAllowed violation (see Validation)

Widening and preprocessors

Widening picks the type. Preprocessors are how the schema declares what coercions were needed to get there. Inference auto-populates them — you rarely write them by hand.

When inference observes a field as a mix of types (some files have priority: 1, others priority: high), it widens to String and writes:

[[fields.field]]
name = "priority"
type = "String"
preprocess = ["coerce-to-string"]

The coerce-to-string entry tells validation: “before checking this value is a string, serialize whatever you find to its JSON representation.” Without it, the field is strict — integers and booleans fail validation.

Same for Float: a mix of 5 and 5.0 widens to Float with preprocess = ["widen-int-to-float"]. Without it, integers fail the float check.

The two built-in Stage 2 preprocessors:

PreprocessorApplies toEffect
coerce-to-stringString, Array(String)Serialize non-strings to their JSON string representation before validation
widen-int-to-floatFloat, Array(Float)Treat integer values as their float equivalent

preprocess = [] means strict. If you delete a preprocessor from mdvs.toml, the field rejects values that would have been coerced. Conversely, you can hand-add a preprocessor to a strict-inferred field if you want to accept type variation.

No preprocessor applies to Date or DateTime. Those types are strict by design — values either parse as RFC 3339 or they don’t. There is no parse-loose-date opt-in; non-ISO formats fall back to String (and the user can add a pattern constraint if they want a custom shape).

In storage — when validation accepts a coerced value, the coerced form is what gets stored. A priority: 1 value with coerce-to-string becomes "1" in the search index. No data is silently dropped.

Re-run mdvs update reinfer <field> to refresh both the inferred type and the inferred preprocessors after editing source files.

Edge cases

  • Empty arrays [] default to Array(String) — if real values are added later, the field must be re-inferred with mdvs update reinfer <field> to pick up the new element type
  • Empty frontmatter (--- followed immediately by ---) is a file with zero fields — not a bare file. It still counts as “having frontmatter” for inference purposes.
  • Bare files (no --- fences at all) are handled differently — see Schema Inference