The Promise vs. The Reality
Zhamak Dehghani's Data Mesh thesis is compelling: treat data as a product, give domain teams ownership of their analytical data, and let a self-serve platform handle the infrastructure. The four principles are elegant:
Domain Ownership
The team that produces the data owns its quality, schema, and SLAs.
Data as a Product
Domain datasets are discoverable, addressable, trustworthy, and self-describing.
Self-Serve Platform
Infrastructure is abstracted so domain teams can publish data without DevOps bottlenecks.
Federated Governance
Global standards (naming, quality, interoperability) are enforced computationally, not by committee.
The theory is sound. The practice? Most data mesh adoptions stall at principle two. Teams build domain-owned pipelines but have no enforceable interface between them. Each domain defines quality differently. Schema changes propagate by Slack message. "Governance" is a wiki page nobody reads.
Data mesh without data contracts is just decentralized chaos with better branding.
The Missing Layer: Data Contracts
A data contract is the enforceable interface between a data producer domain and its consumers. It's not a schema — it's a schema plus quality rules, lineage tracking, quarantine policy, and engine portability. It defines not just the shape of the data, but its meaning and guarantees.
In a data mesh, data contracts solve three critical gaps:
- Inter-domain trust. The
CRMdomain publishes a contract for its orders data. TheFinancedomain consumes it for revenue reporting. If CRM breaks its contract, it's caught at the source — not six layers downstream. - Federated governance, computationally enforced. Global rules (no PII in the silver layer, all monetary fields are non-negative, all timestamps are UTC) are encoded in YAML and validated automatically — not documented in a governance pdf.
- Engine portability. Domain teams choose their own engine. The CRM team uses Spark. The HR team uses Polars. The Finance team uses DuckDB. One data contract runs on all three — no logic drift.
A Practical Example: Enterprise Domains as a Data Mesh
Let's make this concrete. In a real enterprise, data domains map to business functions — not individual tables. CRM owns orders, customers, and products. Finance owns invoices and revenue. HR owns employees and payroll. Sales owns pipeline and forecasts. Each domain team owns their contracts and publishes validated data to a shared lakehouse.
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ CRM Domain │ │ Finance Domain │ │ HR Domain │ │ Sales Domain │
│ │ │ │ │ │ │ │
│ orders.yaml │ │ invoices.yaml │ │ employees.yaml │ │ pipeline.yaml │
│ customers.yaml │ │ revenue.yaml │ │ payroll.yaml │ │ forecasts.yaml │
│ products.yaml │ │ accounts.yaml │ │ benefits.yaml │ │ leads.yaml │
│ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │
│ │ Schema │ │ │ │ Schema │ │ │ │ Schema │ │ │ │ Schema │ │
│ │ Quality │ │ │ │ Quality │ │ │ │ Quality │ │ │ │ Quality │ │
│ │ Quarantine │ │ │ │ Quarantine │ │ │ │ Quarantine │ │ │ │ Quarantine │ │
│ │ Lineage │ │ │ │ Lineage │ │ │ │ Lineage │ │ │ │ Lineage │ │
│ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │ │
└─────────────────────┼─────────────────────┼─────────────────────┘
▼ ▼
┌──────────────────────────────────────────────┐
│ Gold Layer │
│ revenue_analytics.yaml (CRM × Finance) │
│ workforce_costs.yaml (HR × Finance) │
│ sales_forecast.yaml (Sales × CRM) │
└──────────────────────────────────────────────┘
The CRM Domain: Orders Contract
The CRM team owns this contract. It defines what "a valid order" means — globally, computationally,
and unambiguously. Notice the info section: domain and system
classify where this data lives in the mesh, giving lineage and governance tools the context
they need to trace data across domain boundaries.
# CRM domain — owned by the CRM team info: version: 1 name: "orders_silver" domain: "crm" system: "shopify" owner: "crm-engineering@company.com" model: fields: - name: order_id type: string - name: customer_id type: string - name: order_amount type: float - name: order_status type: string - name: order_date type: date quality: row_rules: - sql: "order_amount >= 0" - sql: "order_status IN ('pending', 'shipped', 'delivered', 'returned')" - sql: "order_id IS NOT NULL" - sql: "customer_id IS NOT NULL" quarantine: enabled: true target: "quarantine/crm/orders" include_error_reason: true lineage: enabled: true
The info.domain and info.system fields are the key to making data mesh lineage
work. When the Finance team's revenue_analytics gold contract joins CRM orders with Finance
invoices, the lineage trace shows exactly which domain and source system produced each input —
not just which file.
Running It — Each Domain, Any Engine
Here's the key insight for data mesh: each domain team chooses their own engine, but the contract format is identical. The CRM team uses Spark because they process 200M rows/day. The HR team uses Polars because their datasets are smaller. The governance is the same for both:
from lakelogic import DataProcessor # CRM domain — runs on Spark (200M orders/day) orders = DataProcessor("contracts/crm/orders.yaml", engine="spark").run_source() # Finance domain — runs on DuckDB (10M invoices, single node) invoices = DataProcessor("contracts/finance/invoices.yaml", engine="duckdb").run_source() # HR domain — runs on Polars (50K employees, no cluster needed) employees = DataProcessor("contracts/hr/employees.yaml").run_source() # Sales domain — runs on Polars (pipeline is lean) pipeline = DataProcessor("contracts/sales/pipeline.yaml").run_source() # Same contract format. Same quality rules. Different engines. # Each domain owns its contracts. Zero logic drift.
Mapping Data Contracts to the Four Principles
Here's how data contracts fill the gap in each of the four data mesh principles:
| Data Mesh Principle | Without Contracts | With Data Contracts |
|---|---|---|
| Domain Ownership | Teams own pipelines but quality is undefined | Each domain owns a YAML contract — the definition is the code |
| Data as a Product | "Trustworthy" means "I think it's fine" | Trust is computed: pass rate, quarantine count, lineage trace |
| Self-Serve Platform | Platform team writes validation for every domain | Domain teams self-serve: write YAML, pick an engine, run_source() |
| Federated Governance | Wiki page that nobody reads | Global rules in YAML, enforced computationally on every run |
Federated Governance That Actually Works
The hardest data mesh principle is federated governance. How do you enforce global standards when each domain owns its own pipeline? Most teams try org charts and review committees. That doesn't scale.
With data contracts, governance becomes code, not process:
- Global quality rules — "All monetary fields must be
>= 0" is a row rule in every contract. It's enforced at compute time, not review time. - Quarantine policy — Bad rows are captured with error reasons, not silently dropped or allowed through. The governance team can audit quarantine tables across all domains.
- Lineage tracking — Every contract run produces a lineage record. You can trace any gold-layer anomaly back to the exact domain, contract version, and source file that produced it.
- Schema evolution — When the orders team adds a field, the contract catches downstream breakage before deployment, not after.
The key insight
Federated governance doesn't mean "no governance." It means governance that travels with the data. The contract is the governance layer — it moves through Bronze, Silver, and Gold alongside the data it protects. No separate enforcement step. No manual review gate. The rules are the pipeline.
Data Mesh vs. Data Lake: It's Not Either/Or
A common misconception: data mesh replaces the data lake. It doesn't. Data mesh is an organizational architecture. The data lake (or lakehouse) is the storage architecture. You need both.
In practice, a data mesh runs on top of a medallion lakehouse:
- Bronze layer — raw ingestion, owned by each domain. Contract validates schema and basic completeness.
- Silver layer — cleaned, deduplicated, quality-validated. Contract enforces business rules, quarantines bad rows.
- Gold layer — cross-domain joins and aggregations. Contracts define the join keys and output guarantees.
Each layer has its own contract. Each domain owns its Bronze-to-Silver pipeline. The platform team owns the Gold layer joins. The contracts are the interfaces between them.
Getting Started: Three Steps
You don't need to go full data mesh overnight. Start with contracts:
- Pick one domain. Choose your most critical data pipeline (usually orders or transactions). Write a contract for it.
- Bootstrap from existing data. Use LakeLogic's
infer_contractto auto-generate a starting contract from your landing zone files, then refine the rules. - Expand domain by domain. As each team adopts contracts, inter-domain trust grows organically. The governance layer builds itself.
# Install pip install lakelogic # Auto-generate contracts from your existing data lakelogic bootstrap --landing data/orders/ --output contracts/ \ --registry contracts/reg.yaml --suggest-rules # Validate — same contract, any engine lakelogic run --contract contracts/orders.yaml --source data/orders.csv # Check quarantine — see exactly which rows failed and why lakelogic run --contract contracts/orders.yaml --source data/orders.csv \ --show-quarantine
One Data Contract. Any Engine. Every Domain.
Write your quality rules once in YAML. Run them on Polars, Spark, DuckDB or Pandas. Let the contract be the governance layer your data mesh is missing.