Data contracts for the modern lakehouse. One YAML. Any engine.
LakeLogic is an open-source, contract-driven data engineering framework. Define your schema once in YAML — and get quality enforcement, quarantine routing, synthetic data generation, incremental processing, and data lineage out of the box. No boilerplate. No sprawl.
LakeLogic is easy to install using your standard package manager. Select your environment and get started.
pip install lakelogic
poetry add lakelogic
conda install -c conda-forge lakelogic
Integrates with your stack
Define schema, types, nullability, and quality rules once. Every ingest is validated — bad rows quarantined with exact failure reasons, not silent drops.
Polars-native throughout — no Pandas overhead. Validate millions of rows locally in seconds. The same contract runs identically on a laptop or a Databricks cluster.
LakeLogic is and always will be open source — MIT license. The full framework lives on GitHub. Every engineer is encouraged to contribute.
Most teams stitch together Great Expectations for quality, Faker for test data, custom scripts for ingestion logic, and a separate orchestrator for lineage. Each has a different config format, different runtime, different failure mode.
LakeLogic replaces all four with a single YAML contract that a data engineer writes once — and the framework handles the rest. The same contract that validates your Bronze ingest also generates 50,000 realistic test rows for your CI pipeline.
Bootstrap from an existing file in one command — the output is always a portable, version-controllable YAML.
Drop any file on infer_contract() —
LakeLogic infers the schema, detects null patterns, suggests quality rules from your data's actual
distribution, and returns a ContractDraft you can
inspect, save, or chain directly into a data generator.
No configuration files. No Python classes to subclass. One function call.
from lakelogic import infer_contract, DataGenerator # Infer a full contract from any file draft = infer_contract( "data/orders.csv", title="Orders Bronze", suggest_rules=True, detect_pii=True, ) draft.show() # print YAML to inspect draft.save("contracts/bronze_orders.yaml") # Generate 10k test rows from the inferred contract df = draft.to_generator(seed=42).generate(rows=10_000) # Or skip YAML entirely — pure in-memory flow df = ( DataGenerator.from_file("data/orders.csv") .generate(rows=5_000) ) # Validate on ingest from lakelogic import DataProcessor result = DataProcessor("contracts/bronze_orders.yaml").run(df)
LakeLogic reads and validates any format your lakehouse ingests. Contracts are format-agnostic — the same YAML covers CSV landing files today and Delta tables tomorrow.
Drop any file on infer_contract() —
LakeLogic infers schema, nullability, and quality rules from your data's actual distribution.
DataProcessor enforces every rule at runtime. Bad rows route to quarantine with a per-row reason column — no silent failures, no pipeline crashes.
DataGenerator produces realistic rows from the same contract — seeded from your real data's distributions. CI pipelines that actually catch issues.
Built-in watermark strategies — max target, lookback window, CDC — keep Bronze→Silver runs idempotent automatically. No manual bookmarks.
Every run stamped with source path, run ID, and timestamp. Full lineage from source file → Bronze → Silver → Gold. Debug in minutes, not days.
YAML contracts are version-controlled, PR-reviewed, and environment-aware. Compliance teams get an audit trail. Engineers get a review process.
Your team has the same data validation rule written in at least three places — Spark, Lambda, dbt. When one changes, the others drift. One YAML data contract fixes all three.
Read → ReliabilityOne bad row in a data pipeline shouldn't crash a 2-hour job. Route bad records out automatically, with a reject reason column.
Read → 📐 ArchitectureSchema validation checks shape. A data contract enforces meaning — quality rules, lineage, quarantine, and engine portability.
Read →The full framework is open source under the MIT license. No feature gates, no usage limits.
LakeLogic is MIT-licensed and free forever. Validate, quarantine, and generate data from a single YAML contract.
Release notes & new playbooks. No spam. Unsubscribe any time.