Data Mesh Without the Chaos: How Data Contracts Make Domain Ownership Work

Follow along in the notebook — run the multi-domain e-commerce example live in your browser, no install needed.

The Promise vs. The Reality

Zhamak Dehghani's Data Mesh thesis is compelling: treat data as a product, give domain teams ownership of their analytical data, and let a self-serve platform handle the infrastructure. The four principles are elegant:

🏠

Domain Ownership

The team that produces the data owns its quality, schema, and SLAs.

📦

Data as a Product

Domain datasets are discoverable, addressable, trustworthy, and self-describing.

🔧

Self-Serve Platform

Infrastructure is abstracted so domain teams can publish data without DevOps bottlenecks.

🏛️

Federated Governance

Global standards (naming, quality, interoperability) are enforced computationally, not by committee.

The theory is sound. The practice? Most data mesh adoptions stall at principle two. Teams build domain-owned pipelines but have no enforceable interface between them. Each domain defines quality differently. Schema changes propagate by Slack message. "Governance" is a wiki page nobody reads.

Data mesh without data contracts is just decentralized chaos with better branding.

The Missing Layer: Data Contracts

A data contract is the enforceable interface between a data producer domain and its consumers. It's not a schema — it's a schema plus quality rules, lineage tracking, quarantine policy, and engine portability. It defines not just the shape of the data, but its meaning and guarantees.

In a data mesh, data contracts solve three critical gaps:

Inter-domain trust. The CRM domain publishes a contract for its orders data. The Finance domain consumes it for revenue reporting. If CRM breaks its contract, it's caught at the source — not six layers downstream.
Federated governance, computationally enforced. Global rules (no PII in the silver layer, all monetary fields are non-negative, all timestamps are UTC) are encoded in YAML and validated automatically — not documented in a governance pdf.
Engine portability. Domain teams choose their own engine. The CRM team uses Spark. The HR team uses Polars. The Finance team uses DuckDB. One data contract runs on all three — no logic drift.

A Practical Example: Enterprise Domains as a Data Mesh

Let's make this concrete. In a real enterprise, data domains map to business functions — not individual tables. CRM owns orders, customers, and products. Finance owns invoices and revenue. HR owns employees and payroll. Sales owns pipeline and forecasts. Each domain team owns their contracts and publishes validated data to a shared lakehouse.

  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
  │   CRM Domain     │  │ Finance Domain  │  │   HR Domain      │  │  Sales Domain   │
  │                  │  │                  │  │                  │  │                  │
  │  orders.yaml     │  │  invoices.yaml   │  │  employees.yaml  │  │  pipeline.yaml   │
  │  customers.yaml  │  │  revenue.yaml    │  │  payroll.yaml    │  │  forecasts.yaml  │
  │  products.yaml   │  │  accounts.yaml   │  │  benefits.yaml   │  │  leads.yaml      │
  │  ┌────────────┐  │  │  ┌────────────┐  │  │  ┌────────────┐  │  │  ┌────────────┐  │
  │  │ Schema     │  │  │  │ Schema     │  │  │  │ Schema     │  │  │  │ Schema     │  │
  │  │ Quality    │  │  │  │ Quality    │  │  │  │ Quality    │  │  │  │ Quality    │  │
  │  │ Quarantine │  │  │  │ Quarantine │  │  │  │ Quarantine │  │  │  │ Quarantine │  │
  │  │ Lineage    │  │  │  │ Lineage    │  │  │  │ Lineage    │  │  │  │ Lineage    │  │
  │  └────────────┘  │  │  └────────────┘  │  │  └────────────┘  │  │  └────────────┘  │
  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘
           │                     │                     │                     │
           └─────────────────────┼─────────────────────┼─────────────────────┘
                                 ▼                     ▼
                  ┌──────────────────────────────────────────────┐
                  │              Gold Layer                      │
                  │  revenue_analytics.yaml  (CRM × Finance)    │
                  │  workforce_costs.yaml    (HR × Finance)     │
                  │  sales_forecast.yaml     (Sales × CRM)      │
                  └──────────────────────────────────────────────┘

The CRM Domain: Orders Contract

The CRM team owns this contract. It defines what "a valid order" means — globally, computationally, and unambiguously. Notice the info section: domain and system classify where this data lives in the mesh, giving lineage and governance tools the context they need to trace data across domain boundaries.

contracts/crm/orders.yaml

# CRM domain — owned by the CRM team
info:
  version: 1
  name: "orders_silver"
  domain: "crm"
  system: "shopify"
  owner: "crm-engineering@company.com"

model:
  fields:
    - name: order_id
      type: string
    - name: customer_id
      type: string
    - name: order_amount
      type: float
    - name: order_status
      type: string
    - name: order_date
      type: date

quality:
  row_rules:
    - sql: "order_amount >= 0"
    - sql: "order_status IN ('pending', 'shipped', 'delivered', 'returned')"
    - sql: "order_id IS NOT NULL"
    - sql: "customer_id IS NOT NULL"

quarantine:
  enabled: true
  target: "quarantine/crm/orders"
  include_error_reason: true

lineage:
  enabled: true

The info.domain and info.system fields are the key to making data mesh lineage work. When the Finance team's revenue_analytics gold contract joins CRM orders with Finance invoices, the lineage trace shows exactly which domain and source system produced each input — not just which file.

Running It — Each Domain, Any Engine

Here's the key insight for data mesh: each domain team chooses their own engine, but the contract format is identical. The CRM team uses Spark because they process 200M rows/day. The HR team uses Polars because their datasets are smaller. The governance is the same for both:

domain_pipelines.py

from lakelogic import DataProcessor

# CRM domain — runs on Spark (200M orders/day)
orders = DataProcessor("contracts/crm/orders.yaml", engine="spark").run_source()

# Finance domain — runs on DuckDB (10M invoices, single node)
invoices = DataProcessor("contracts/finance/invoices.yaml", engine="duckdb").run_source()

# HR domain — runs on Polars (50K employees, no cluster needed)
employees = DataProcessor("contracts/hr/employees.yaml").run_source()

# Sales domain — runs on Polars (pipeline is lean)
pipeline = DataProcessor("contracts/sales/pipeline.yaml").run_source()

# Same contract format. Same quality rules. Different engines.
# Each domain owns its contracts. Zero logic drift.

Mapping Data Contracts to the Four Principles

Here's how data contracts fill the gap in each of the four data mesh principles:

Data Mesh Principle	Without Contracts	With Data Contracts
Domain Ownership	Teams own pipelines but quality is undefined	Each domain owns a YAML contract — the definition is the code
Data as a Product	"Trustworthy" means "I think it's fine"	Trust is computed: pass rate, quarantine count, lineage trace
Self-Serve Platform	Platform team writes validation for every domain	Domain teams self-serve: write YAML, pick an engine, `run_source()`
Federated Governance	Wiki page that nobody reads	Global rules in YAML, enforced computationally on every run

Federated Governance That Actually Works

The hardest data mesh principle is federated governance. How do you enforce global standards when each domain owns its own pipeline? Most teams try org charts and review committees. That doesn't scale.

With data contracts, governance becomes code, not process:

Global quality rules — "All monetary fields must be >= 0" is a row rule in every contract. It's enforced at compute time, not review time.
Quarantine policy — Bad rows are captured with error reasons, not silently dropped or allowed through. The governance team can audit quarantine tables across all domains.
Lineage tracking — Every contract run produces a lineage record. You can trace any gold-layer anomaly back to the exact domain, contract version, and source file that produced it.
Schema evolution — When the orders team adds a field, the contract catches downstream breakage before deployment, not after.

The key insight

Federated governance doesn't mean "no governance." It means governance that travels with the data. The contract is the governance layer — it moves through Bronze, Silver, and Gold alongside the data it protects. No separate enforcement step. No manual review gate. The rules are the pipeline.

Data Mesh vs. Data Lake: It's Not Either/Or

A common misconception: data mesh replaces the data lake. It doesn't. Data mesh is an organizational architecture. The data lake (or lakehouse) is the storage architecture. You need both.

In practice, a data mesh runs on top of a medallion lakehouse:

Bronze layer — raw ingestion, owned by each domain. Contract validates schema and basic completeness.
Silver layer — cleaned, deduplicated, quality-validated. Contract enforces business rules, quarantines bad rows.
Gold layer — cross-domain joins and aggregations. Contracts define the join keys and output guarantees.

Each layer has its own contract. Each domain owns its Bronze-to-Silver pipeline. The platform team owns the Gold layer joins. The contracts are the interfaces between them.

Getting Started: Three Steps

You don't need to go full data mesh overnight. Start with contracts:

Pick one domain. Choose your most critical data pipeline (usually orders or transactions). Write a contract for it.
Bootstrap from existing data. Use LakeLogic's infer_contract to auto-generate a starting contract from your landing zone files, then refine the rules.
Expand domain by domain. As each team adopts contracts, inter-domain trust grows organically. The governance layer builds itself.

terminal

# Install
pip install lakelogic

# Auto-generate contracts from your existing data
lakelogic bootstrap --landing data/orders/ --output contracts/ \
    --registry contracts/reg.yaml --suggest-rules

# Validate — same contract, any engine
lakelogic run --contract contracts/orders.yaml --source data/orders.csv

# Check quarantine — see exactly which rows failed and why
lakelogic run --contract contracts/orders.yaml --source data/orders.csv \
    --show-quarantine

One Data Contract. Any Engine. Every Domain.

Write your quality rules once in YAML. Run them on Polars, Spark, DuckDB or Pandas. Let the contract be the governance layer your data mesh is missing.

⭐ Star on GitHub Read the Docs →