TL;DR: The data quality stack I kept inheriting across data engineering engagements looked like this: Great Expectations for schema rules, Faker for test data (maintained separately, drifted constantly), custom PySpark scripts for business rules, and a lineage tracker that was always two sprints behind. Four tools. Four config formats. Four runtime environments. Four different ways to fail silently. LakeLogic is one YAML contract that replaces all four. Here's what that migration looks like end-to-end, and where the gaps still are.
The Stack I Kept Inheriting
The pattern is consistent enough that I stopped being surprised by it. You join a data engineering team, or you audit one, and the validation stack looks some version of this:
- Great Expectations for schema and column-level quality rules at Bronze ingest
- Faker (or equivalent) for generating synthetic test data in CI — maintained separately, drifted from real schemas constantly
- Custom PySpark scripts for business rule validation — undocumented, no standard format, different failure behaviour per script
- A lineage tracker bolted onto ADF or Airflow — never quite accurate, always lagging behind real pipeline state
The incident that crystallised this for me: a Monday morning where a downstream Power BI report showed revenue figures 23% lower than expected. The data had been wrong since Thursday. Nobody knew.
Root cause: a field in an upstream CRM export changed from VARCHAR(50) to
VARCHAR(255).
Harmless on its own. But it broke a type-casting assumption buried in a PySpark validation script that
nobody had touched in eight months. Great Expectations passed the batch. The PySpark script failed silently.
The lineage tracker didn't catch the discrepancy. The Faker-generated test data in CI was still producing
VARCHAR(50) values — so CI also passed.
That incident wasn't a data quality failure. It was a coordination failure — four tools with no shared understanding of what the data was supposed to look like.
That's the problem LakeLogic is built to solve. One YAML file that defines schema, quality rules, test data generation, and lineage stamping. One runtime that runs identically across Polars, Spark, DuckDB, and Pandas. One place to look when something breaks.
Why Not Just Fix Great Expectations?
Before building anything, I evaluated the obvious options:
- Double down on Great Expectations — extend the existing suite, enforce better discipline around updates. The problem: GE addresses schema validation well, but it doesn't generate test data, doesn't stamp lineage, and doesn't run the same rules on Polars, Spark, and DuckDB from the same config. You still need the three other tools.
- Soda.io — good cloud-hosted observability angle, but per-column pricing gets expensive fast at scale. And it's still a separate system from your test data and lineage tooling.
- Build a thin wrapper — something that coordinates GE + Faker + lineage. After two prototypes this felt like the wrong abstraction — stitching together tools with different data models rather than unifying the underlying contract.
The deciding factor: a data contract should be the single source of truth for everything related to what a dataset is supposed to look like. Schema, types, nullability, quality rules, test data distributions, lineage metadata — all of it should derive from one definition, not four.
Another signal: infer_contract() on an existing file can bootstrap a usable contract
in seconds, covering schema, type inference, nullability, and statistically-derived quality rules
automatically. That means migrating an existing source starts at ~70% done with one command.
The Before: What 1,847 Lines of Validation Looked Like
Here's a representative example — a Bronze Orders source from an e-commerce platform, one of the most common data engineering patterns you'll encounter. This is what the validation configuration looks like across the four-tool stack:
# great_expectations/expectations/orders_bronze.json # 340 lines for a single source table { "expectation_suite_name": "orders_bronze", "expectations": [ { "expectation_type": "expect_column_to_exist", "kwargs": { "column": "order_id" } }, { "expectation_type": "expect_column_values_to_not_be_null", "kwargs": { "column": "order_id", "mostly": 1.0 } }, { "expectation_type": "expect_column_values_to_match_regex", "kwargs": { "column": "order_id", "regex": "^ORD-[0-9]{8}$" } }, { "expectation_type": "expect_column_values_to_be_between", "kwargs": { "column": "order_amount", "min_value": 0 } } /* ... 280 more lines */ ] } # SEPARATE: faker_orders.py — 147 lines, schema-drifted # SEPARATE: validate_orders_spark.py — 203 lines, undocumented # SEPARATE: lineage_tracker.py — 97 lines, broken since Q2 # Total for ONE source: 787 lines, 4 files, 4 formats # Multiply by 23 sources: 1,847 lines nobody fully owns
The critical failure mode isn't the line count — it's that when a business rule changes (say, operations adds "exchange" to the order status enum), you need to update four separate files. Miss one and you won't know which environment is enforcing the stale rule until bad records reach your Gold layer and break a downstream dashboard.
The After: One Contract, Same Rules Everywhere
Here is the equivalent LakeLogic contract for the same Bronze Orders source:
version: "1.0.0" dataset: bronze_orders info: title: Bronze Orders owner: data-platform source: type: raw_landing path: adls://landing/orders/ model: fields: - name: order_id type: string required: true - name: customer_id type: string required: true - name: order_amount type: float required: true - name: order_date type: date required: true - name: order_status type: string required: true - name: sku type: string quality: row_rules: - name: positive_amount sql: "order_amount >= 0" category: correctness - name: valid_status sql: "order_status IN ('pending','processing','shipped','delivered','cancelled','refunded')" category: correctness - regex_match: field: order_id pattern: "^ORD-[0-9]{8}$" lineage: enabled: true capture_source_path: true capture_run_id: true quarantine: enabled: true include_error_reason: true
36 lines replaces 787. The same contract validates Bronze ingest, generates CI test data, and stamps lineage metadata. One file. One format. One runtime.
What Migration Looks Like in Practice
Step 1: Bootstrap from Existing Data
For any existing Bronze source, run infer_contract() against a sample file
from the landing zone. It infers schema, nullability, and statistical quality rules from
actual data distributions:
from lakelogic import infer_contract draft = infer_contract( 'adls://landing/orders/sample_2024_q4.parquet', dataset='bronze_orders', suggest_rules=True, # infers rules from actual data distribution detect_pii=True, # flags customer_id, email columns ) draft.show() # inspect before saving draft.save('contracts/bronze_orders.yaml')
The bootstrapped contract covers roughly 70% of what you'd write by hand — schema, types,
and nullability are inferred correctly from data. You then add business rules manually:
domain-specific constraints like allowed_values and not_future
that require domain knowledge rather than statistical inference.
Step 2: Validate Against Existing Data — and Find Hidden Issues
from lakelogic import DataProcessor result = DataProcessor('contracts/bronze_orders.yaml').run( 'adls://bronze/orders/2024-11-*' ) print(result.summary()) # Passed: 98.7% # Quarantined: 1.3% (1,247 rows) # Reject reasons: # positive_amount: order_amount < 0: 891 rows # regex_match: order_id format invalid: 247 rows # valid_status: order_status unknown value: 109 rows
This is the key difference from Great Expectations: you get per-row reject reasons,
not batch pass/fail. The 891 rows with order_amount < 0 — negative order amounts
are physically impossible and point directly to a sign-flip bug in the upstream order export
script that had been silently corrupting revenue reporting for weeks.
Great Expectations would have told you the batch failed. LakeLogic tells you specifically
which rows, and why.
Step 3: Replace Schema-Drifted Test Data Generation
The most structurally important change: when test data is generated from the same contract
that validates production data, schema drift between test and production is architecturally
impossible. You can't have a CI Faker script that produces VARCHAR(50) values
for a field that's been VARCHAR(255) in production for a year — because both
derive from the same YAML.
from lakelogic import DataGenerator # Test data generated from the same contract that validates production. # Schema drift is structurally impossible. df = DataGenerator.from_contract('contracts/bronze_orders.yaml') \ .generate(rows=50_000, seed=42) # Same contract validates the generated data — CI is testing the real rules result = DataProcessor('contracts/bronze_orders.yaml').run(df)
What This Eliminates — By Design
Before → After
| Capability | Four-tool stack | LakeLogic |
|---|---|---|
| Config format | GE JSON + Python + dbt YAML + custom | One YAML, any engine |
| Bad row visibility | Batch pass/fail only | Per-row reject reason column |
| Test data | Separately maintained, drifts from schema | Generated from contract — drift impossible |
| Lineage | Manual tracker, always lagging | Auto-stamped per run (source, run_id, timestamp) |
| New source onboarding | 2–3 sprint days across all four tools | infer_contract() + manual business rules |
Current Gaps — What's Still Missing
I'd rather be direct about where LakeLogic doesn't yet fully solve the problem than sell something it isn't. Three honest gaps:
1. Business Rules Still Need Domain Experts
infer_contract() is excellent for schema and statistical rules. It's less helpful
for domain-specific constraints — things like "a refunded order must have a non-null
refund_reason" or "order_amount for enterprise customers can't exceed their pre-approved
credit limit."
Those require domain knowledge that no amount of inference can substitute for. Plan for domain
experts to spend time articulating rules they may never have had to write down before — that
articulation is often more valuable than the contract itself, but it takes longer than expected.
2. No UI for Non-Technical Contract Review
Compliance teams and data owners often need to review contracts before they go to production. The YAML is readable by engineers but not by analysts or compliance reviewers. You'll likely need to render contracts as HTML tables or a readable summary for non-technical review. This is the gap I most want to close — a hosted contract registry with a review interface. It's on the roadmap, it's not there yet.
3. Parallel Running During Migration Has a Cost
If you're migrating from an existing validation stack, running both systems in parallel for confidence is the right call — but it doubles your validation compute cost during that period. Be deliberate about decommissioning the old system once you have confidence rather than leaving it running indefinitely out of caution.
Should You Make This Switch?
These conditions are the signal that the four-tool stack is costing you more than it's worth:
- You have validation logic duplicated across Spark, dbt, and CI
- You've had a data quality incident caused by schema drift in the last 12 months
- You can't tell a stakeholder exactly which rows failed validation and why
- Onboarding a new data source takes more than a day of engineering time
- Your test data generation is maintained separately from your schema definitions
If none of those apply, your data quality setup is unusually healthy and you should write about it — I'd genuinely like to know what you're doing.
One Data Contract. Any Engine.
Write your quality rules once in YAML. Run them on Polars, Spark, DuckDB, or Pandas. Per-row reject reasons. Auto-stamped lineage. Open source, MIT licensed.