Sixteen years ago, in 2010, I published a paper on formalizing schemas for tree-structured data, like XML and JSON. My goal was to move beyond guesswork: could we mathematically decide if a schema change is backward-compatible? I wanted to see if we could prove schema equivalence, or reliably infer schemas from examples.
The research offered a solid theoretical framework, though it remained largely academic. For years, I hoped to build something practical from it, but the demands of a full-time job made finding the time to bridge that gap nearly impossible.
AI finally helped me bridge that gap. Working with it in my spare time, I turned that research into Omnist — a functional, tested, open source Python project, built in two weeks. This post covers what Omnist does. (How two weeks was even possible is its own post, coming next.)
What Omnist actually does
Omnist’s core idea is a single tree model underneath everything: a node is either a scalar value or an ordered list of labeled edges. JSON, YAML, TOML, and XML all map onto that same structure, which is what lets Omnist convert between them without a bespoke converter for every pair.
Omnist also has its own native data and schema languages built on that model. The Omnist Markup Language (OML) is the native data format — the one with zero adjustments needed to map onto the underlying tree. The Omnist Schema Definition (OSD) is the text syntax for defining schemas: named record types with closed fields, each given a cardinality range, where a field’s type is either a fixed scalar or a reference to another record.
Simplicity is the design philosophy behind that schema model. Every field has exactly one type — no unions, no enums, no open-ended “any.” That’s a real constraint, but it’s a deliberate trade: it’s what keeps compatible_with and equivalent actual computations instead of heuristics. Once a field can be “this type, or that, or null,” there’s ambiguity left over for compatibility checking to get stuck on. Omnist’s model doesn’t leave any.
That foundation gives you:
- Convert between any of those formats, ingest and export freely, without hand-writing a converter for each pair — or develop a plugin to add a new one.
- Validate a document against a schema, with exact, path-based error reporting instead of a vague “something is wrong somewhere.”
- Schema-directed deserialization: read an untyped text format into typed data, upgrading values like ISO date strings into real
date/time/datetimetypes whenever the conversion is value-exact. - Infer a schema automatically from example documents.
- Ask
equivalent(a, b)andcompatible_with(v1, v2)— do two schemas accept exactly the same set of documents, or is one backward-compatible with the other. - Normalize a schema down to its minimal equivalent form.
All of it works as a CLI and as a Python library, so it drops into a CI pipeline or a script equally well.
Omnist is built for software engineers, not researchers — it’s meant to answer practical questions you run into while shipping software, not to explore schema theory for its own sake. A few places it fits:
- CI gate for config and API schema changes. Run
compatible_within a pipeline step so a backward-incompatible schema change fails the build instead of breaking a downstream consumer in production. - Migrating a config format. Moving a project from YAML to TOML, or vice versa, without writing and maintaining a one-off converter script.
- Onboarding a legacy data source. Infer a schema from a pile of existing JSON examples when no one wrote one down, then validate new data against it going forward.
- Type-safe ingestion in a data pipeline. Deserialize untyped JSON or YAML straight into typed structures, with errors that point at the exact path that’s wrong instead of a generic parse failure.
Compatibility, in practice
A config schema evolves — you add an optional field. Is that safe for everyone still running the old schema?
v1 = parse_schema('record R { "host": string }\nroot R')
v2 = parse_schema('record R { "host": string, "port" [0,1]: integer }\nroot R')
v1.compatible_with(v2) # True — every v1 document is still valid under v2
v2.compatible_with(v1) # False — a v2 document with a port isn't valid under v1
That’s the whole feature, in four lines. No example corpus, no fixtures, no “looks fine to me” — an actual, decidable answer, the same way a type checker gives you an actual answer instead of a guess.
Try it
pip install omnist gets you the library and the omnist CLI. Head to omnist.dev to get started:
- Quickstart — up and running in five minutes
- Real-life example — a worked end-to-end walkthrough
- API reference — full Python API docs
- CLI docs — all commands and flags
The source is at github.com/omnist-dev/omnist.
If you try it, I’d love to hear what you think — what worked, what didn’t, or what you wish it could do. Find me on X (@lee_tom) or LinkedIn.
Next up
How a sixteen-year-old paper became a working package in two weeks — the workflow that did most of the work, and the line between what a human has to decide and what an AI agent can be left alone to execute.