I Built LakeLogic Because 1,847 Lines of Great Expectations Weren't Telling Me Which Rows Failed

TL;DR: The data quality stack I kept inheriting across data engineering engagements looked like this: Great Expectations for schema rules, Faker for test data (maintained separately, drifted constantly), custom PySpark scripts for business rules, and a lineage tracker that was always two sprints behind. Four tools. Four config formats. Four runtime environments. Four different ways to fail silently. LakeLogic is one YAML contract that replaces all four. Here's what that migration looks like end-to-end, and where the gaps still are.

The Stack I Kept Inheriting

The pattern is consistent enough that I stopped being surprised by it. You join a data engineering team, or you audit one, and the validation stack looks some version of this:

Great Expectations for schema and column-level quality rules at Bronze ingest
Faker (or equivalent) for generating synthetic test data in CI — maintained separately, drifted from real schemas constantly
Custom PySpark scripts for business rule validation — undocumented, no standard format, different failure behaviour per script
A lineage tracker bolted onto ADF or Airflow — never quite accurate, always lagging behind real pipeline state

The incident that crystallised this for me: a Monday morning where a downstream Power BI report showed revenue figures 23% lower than expected. The data had been wrong since Thursday. Nobody knew.

Root cause: a field in an upstream CRM export changed from VARCHAR(50) to VARCHAR(255). Harmless on its own. But it broke a type-casting assumption buried in a PySpark validation script that nobody had touched in eight months. Great Expectations passed the batch. The PySpark script failed silently. The lineage tracker didn't catch the discrepancy. The Faker-generated test data in CI was still producing VARCHAR(50) values — so CI also passed.

That incident wasn't a data quality failure. It was a coordination failure — four tools with no shared understanding of what the data was supposed to look like.

That's the problem LakeLogic is built to solve. One YAML file that defines schema, quality rules, test data generation, and lineage stamping. One runtime that runs identically across Polars, Spark, DuckDB, and Pandas. One place to look when something breaks.

Why Not Just Fix Great Expectations?

Before building anything, I evaluated the obvious options:

Double down on Great Expectations — extend the existing suite, enforce better discipline around updates. The problem: GE addresses schema validation well, but it doesn't generate test data, doesn't stamp lineage, and doesn't run the same rules on Polars, Spark, and DuckDB from the same config. You still need the three other tools.
Soda.io — good cloud-hosted observability angle, but per-column pricing gets expensive fast at scale. And it's still a separate system from your test data and lineage tooling.
Build a thin wrapper — something that coordinates GE + Faker + lineage. After two prototypes this felt like the wrong abstraction — stitching together tools with different data models rather than unifying the underlying contract.

The deciding factor: a data contract should be the single source of truth for everything related to what a dataset is supposed to look like. Schema, types, nullability, quality rules, test data distributions, lineage metadata — all of it should derive from one definition, not four.

Another signal: infer_contract() on an existing file can bootstrap a usable contract in seconds, covering schema, type inference, nullability, and statistically-derived quality rules automatically. That means migrating an existing source starts at ~70% done with one command.

The Before: What 1,847 Lines of Validation Looked Like

Here's a representative example — a Bronze Orders source from an e-commerce platform, one of the most common data engineering patterns you'll encounter. This is what the validation configuration looks like across the four-tool stack:

great_expectations/expectations/orders_bronze.json — 340 lines for one source

# great_expectations/expectations/orders_bronze.json
# 340 lines for a single source table
{
  "expectation_suite_name": "orders_bronze",
  "expectations": [
    {
      "expectation_type": "expect_column_to_exist",
      "kwargs": { "column": "order_id" }
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": { "column": "order_id", "mostly": 1.0 }
    },
    {
      "expectation_type": "expect_column_values_to_match_regex",
      "kwargs": { "column": "order_id", "regex": "^ORD-[0-9]{8}$" }
    },
    {
      "expectation_type": "expect_column_values_to_be_between",
      "kwargs": { "column": "order_amount", "min_value": 0 }
    }
    /* ... 280 more lines */
  ]
}

# SEPARATE: faker_orders.py           — 147 lines, schema-drifted
# SEPARATE: validate_orders_spark.py  — 203 lines, undocumented
# SEPARATE: lineage_tracker.py        — 97 lines, broken since Q2

# Total for ONE source: 787 lines, 4 files, 4 formats
# Multiply by 23 sources: 1,847 lines nobody fully owns

The critical failure mode isn't the line count — it's that when a business rule changes (say, operations adds "exchange" to the order status enum), you need to update four separate files. Miss one and you won't know which environment is enforcing the stale rule until bad records reach your Gold layer and break a downstream dashboard.

The After: One Contract, Same Rules Everywhere

Here is the equivalent LakeLogic contract for the same Bronze Orders source:

contracts/bronze_orders.yaml — 36 lines. Replaces 787.

version: "1.0.0"
dataset: bronze_orders

info:
  title: Bronze Orders
  owner: data-platform

source:
  type: raw_landing
  path: adls://landing/orders/

model:
  fields:
    - name: order_id
      type: string
      required: true
    - name: customer_id
      type: string
      required: true
    - name: order_amount
      type: float
      required: true
    - name: order_date
      type: date
      required: true
    - name: order_status
      type: string
      required: true
    - name: sku
      type: string

quality:
  row_rules:
    - name: positive_amount
      sql: "order_amount >= 0"
      category: correctness
    - name: valid_status
      sql: "order_status IN ('pending','processing','shipped','delivered','cancelled','refunded')"
      category: correctness
    - regex_match:
        field: order_id
        pattern: "^ORD-[0-9]{8}$"

lineage:
  enabled: true
  capture_source_path: true
  capture_run_id: true

quarantine:
  enabled: true
  include_error_reason: true

36 lines replaces 787. The same contract validates Bronze ingest, generates CI test data, and stamps lineage metadata. One file. One format. One runtime.

What Migration Looks Like in Practice

Step 1: Bootstrap from Existing Data

For any existing Bronze source, run infer_contract() against a sample file from the landing zone. It infers schema, nullability, and statistical quality rules from actual data distributions:

bootstrap.py

from lakelogic import infer_contract

draft = infer_contract(
    'adls://landing/orders/sample_2024_q4.parquet',
    dataset='bronze_orders',
    suggest_rules=True,   # infers rules from actual data distribution
    detect_pii=True,      # flags customer_id, email columns
)

draft.show()   # inspect before saving
draft.save('contracts/bronze_orders.yaml')

The bootstrapped contract covers roughly 70% of what you'd write by hand — schema, types, and nullability are inferred correctly from data. You then add business rules manually: domain-specific constraints like allowed_values and not_future that require domain knowledge rather than statistical inference.

Step 2: Validate Against Existing Data — and Find Hidden Issues

validate.py

from lakelogic import DataProcessor

result = DataProcessor('contracts/bronze_orders.yaml').run(
    'adls://bronze/orders/2024-11-*'
)

print(result.summary())
# Passed:      98.7%
# Quarantined:  1.3%  (1,247 rows)
# Reject reasons:
#   positive_amount: order_amount < 0:          891 rows
#   regex_match: order_id format invalid:        247 rows
#   valid_status: order_status unknown value:    109 rows

This is the key difference from Great Expectations: you get per-row reject reasons, not batch pass/fail. The 891 rows with order_amount < 0 — negative order amounts are physically impossible and point directly to a sign-flip bug in the upstream order export script that had been silently corrupting revenue reporting for weeks. Great Expectations would have told you the batch failed. LakeLogic tells you specifically which rows, and why.

Step 3: Replace Schema-Drifted Test Data Generation

The most structurally important change: when test data is generated from the same contract that validates production data, schema drift between test and production is architecturally impossible. You can't have a CI Faker script that produces VARCHAR(50) values for a field that's been VARCHAR(255) in production for a year — because both derive from the same YAML.

ci_generate.py

from lakelogic import DataGenerator

# Test data generated from the same contract that validates production.
# Schema drift is structurally impossible.
df = DataGenerator.from_contract('contracts/bronze_orders.yaml') \
                  .generate(rows=50_000, seed=42)

# Same contract validates the generated data — CI is testing the real rules
result = DataProcessor('contracts/bronze_orders.yaml').run(df)

What This Eliminates — By Design

Before → After

Validation config

1,847

lines across 4 formats

With LakeLogic

YAML contracts, 1 format

Capability	Four-tool stack	LakeLogic
Config format	GE JSON + Python + dbt YAML + custom	One YAML, any engine
Bad row visibility	Batch pass/fail only	Per-row reject reason column
Test data	Separately maintained, drifts from schema	Generated from contract — drift impossible
Lineage	Manual tracker, always lagging	Auto-stamped per run (source, run_id, timestamp)
New source onboarding	2–3 sprint days across all four tools	infer_contract() + manual business rules

Current Gaps — What's Still Missing

I'd rather be direct about where LakeLogic doesn't yet fully solve the problem than sell something it isn't. Three honest gaps:

1. Business Rules Still Need Domain Experts

infer_contract() is excellent for schema and statistical rules. It's less helpful for domain-specific constraints — things like "a refunded order must have a non-null refund_reason" or "order_amount for enterprise customers can't exceed their pre-approved credit limit." Those require domain knowledge that no amount of inference can substitute for. Plan for domain experts to spend time articulating rules they may never have had to write down before — that articulation is often more valuable than the contract itself, but it takes longer than expected.

2. No UI for Non-Technical Contract Review

Compliance teams and data owners often need to review contracts before they go to production. The YAML is readable by engineers but not by analysts or compliance reviewers. You'll likely need to render contracts as HTML tables or a readable summary for non-technical review. This is the gap I most want to close — a hosted contract registry with a review interface. It's on the roadmap, it's not there yet.

3. Parallel Running During Migration Has a Cost

If you're migrating from an existing validation stack, running both systems in parallel for confidence is the right call — but it doubles your validation compute cost during that period. Be deliberate about decommissioning the old system once you have confidence rather than leaving it running indefinitely out of caution.

Should You Make This Switch?

These conditions are the signal that the four-tool stack is costing you more than it's worth:

You have validation logic duplicated across Spark, dbt, and CI
You've had a data quality incident caused by schema drift in the last 12 months
You can't tell a stakeholder exactly which rows failed validation and why
Onboarding a new data source takes more than a day of engineering time
Your test data generation is maintained separately from your schema definitions

If none of those apply, your data quality setup is unusually healthy and you should write about it — I'd genuinely like to know what you're doing.

One Data Contract. Any Engine.

Write your quality rules once in YAML. Run them on Polars, Spark, DuckDB, or Pandas. Per-row reject reasons. Auto-stamped lineage. Open source, MIT licensed.

⭐ Star on GitHub Read the Docs →

I Built LakeLogic Because 1,847 Lines of Great Expectations Weren't Telling Me Which Rows Failed

The Stack I Kept Inheriting

Why Not Just Fix Great Expectations?

The Before: What 1,847 Lines of Validation Looked Like

The After: One Contract, Same Rules Everywhere

What Migration Looks Like in Practice

Step 1: Bootstrap from Existing Data

Step 2: Validate Against Existing Data — and Find Hidden Issues

Step 3: Replace Schema-Drifted Test Data Generation

What This Eliminates — By Design

Before → After

Current Gaps — What's Still Missing

1. Business Rules Still Need Domain Experts

2. No UI for Non-Technical Contract Review

3. Parallel Running During Migration Has a Cost

Should You Make This Switch?

One Data Contract. Any Engine.

Continue Reading

Stop the Spark Tax: One Data Contract, Any Engine

How Quarantine Saved Our Pipeline (And My Sleep)