The Problem: The Same Rule, Written Three Times
Here's something most data engineering teams don't notice until it bites them. The rule
status IN ('active', 'churned') exists in your codebase at least three times:
# 1. Spark job — Silver layer processing df.filter(col("status").isin(["active", "churned"])) # 2. Lambda / ACA — Bronze ingestion pre-check assert df["status"].is_in(["active", "churned"]).all() # 3. dbt — Gold layer test # accepted_values: {column: status, values: [active, churned]}
Product adds "trial" to the status enum. One PR. Which of those three breaks tonight?
This is the real Spark Tax — not the cost of running Spark, but the cost of maintaining the same validation logic separately in every engine across your stack.
When the rule drifts between environments, your CI passes and your production job fails. Or worse: both pass, but they're enforcing different rules, and you don't know it until a bad record reaches your Gold layer six weeks later.
The Fix: Same Data Contract, Different Engine
LakeLogic is engine-agnostic. You write your quality rules once in YAML, and the same data contract runs identically on Polars, Spark, DuckDB, or Pandas. No code changes. No separate validation logic per environment.
# Same YAML. Runs on Polars, Spark, DuckDB, Pandas. quality: row_rules: - sql: "revenue >= 0" - sql: "email LIKE '%@%.%'" - sql: "status IN ('active', 'churned', 'pending')" quarantine: enabled: true target: "quarantine/customers"
Running on Polars (Development + Lightweight Production)
from lakelogic import DataProcessor # Uses Polars under the hood — no JVM, no cluster result = DataProcessor("contract.yaml").run_source() print(f"✅ {len(result.good)} valid | ❌ {len(result.bad)} quarantined")
Running on Spark (Large-Scale Production)
from lakelogic import DataProcessor # Same contract — just change the engine result = DataProcessor( "contract.yaml", engine="spark", ).run_source()
The key insight: you don't need to rewrite your validation logic when you move between engines. Write your data contract once locally on Polars (instant startup, zero dependencies), deploy to Spark only when your data volumes actually require distributed compute.
When to Use What
| Scenario | Engine | Why |
|---|---|---|
| Files under 1 GB | Polars | 10x faster startup, $5/month container |
| 1–50 GB files | DuckDB | Single-node analytical power, no cluster |
| 50+ GB / multi-TB | Spark | Distributed compute is justified at this scale |
| Local development | Polars or Pandas | Instant, no infrastructure needed |
The rule of thumb: if your file fits in memory on a single node, you don't need a distributed compute cluster to validate it.
The Real Cost: Maintenance, Drift, and the 2am Incident
The Spark Tax isn't primarily a cloud bill problem — it's an engineering reliability problem. Every time your validation logic lives in multiple places, you're accumulating maintenance debt:
- Rule changes require multiple PRs — update Spark, update Lambda, update dbt. Hope nobody misses one.
- Environments enforce different rules — CI says valid, production says invalid. Or both say valid and both are wrong.
- Onboarding is harder — a new engineer has to find and understand validation logic scattered across the stack.
- Testing is fragile — your unit tests cover the Polars path. Your integration tests cover Spark. Neither covers the case where they diverge.
LakeLogic doesn't tell you to stop using Spark. It tells you to stop reimplementing the same rules for every engine you touch. Use Spark where you genuinely need distributed compute — 50 GB+ loads, multi-TB reprocessing, streaming at scale. Use Polars or DuckDB for everything else. And let one data contract drive both.
Try It Yourself
# Install pip install lakelogic # Bootstrap contracts from your landing zone lakelogic bootstrap --landing data/ --output contracts/ \ --registry contracts/reg.yaml --suggest-rules # Run validation on Polars (default) lakelogic run --contract contracts/orders.yaml --source data/orders.csv # Generate test data to verify quarantine lakelogic generate --contract contracts/orders.yaml \ --rows 1000 --invalid-ratio 0.1 --preview 5
One Data Contract. Any Engine.
Write your quality rules once in YAML. Run them on Polars, Spark, DuckDB or Pandas — no rewrites. Open source, MIT licensed.