Your Lakehouse.
Under Contract.

Data contracts for the modern lakehouse. One YAML. Any engine.

Open source · MIT License · Polars-native speed · Databricks & Fabric-ready

★ Azure Reference Architecture · Bronze → Silver → Gold · 40+ contracts included

Get started ★ View on GitHub

PyPI Downloads/wk … PyPI …

Quick install

LakeLogic is easy to install using your standard package manager. Select your environment and get started.

pip install lakelogic

poetry add lakelogic

conda install -c conda-forge lakelogic

Benefits

Everything your data lake was missing

Contract-first quality

Define schema, types, nullability, and quality rules once. Every ingest is validated — bad rows quarantined with exact failure reasons, not silent drops.

Embarrassingly fast

Polars-native throughout — no Pandas overhead. Validate millions of rows locally in seconds. The same contract runs identically on a laptop or a Databricks cluster.

Open source

LakeLogic is and always will be open source — MIT license. The full framework lives on GitHub. Every engineer is encouraged to contribute.

Built by data engineers
to replace four tools with one

Why LakeLogic

Most teams stitch together Great Expectations for quality, Faker for test data, custom scripts for ingestion logic, and a separate orchestrator for lineage. Each has a different config format, different runtime, different failure mode.

LakeLogic replaces all four with a single YAML contract that a data engineer writes once — and the framework handles the rest. The same contract that validates your Bronze ingest also generates 50,000 realistic test rows for your CI pipeline.

Bootstrap from an existing file in one command — the output is always a portable, version-controllable YAML.

Zero to pipeline

From raw file to
contract in one line

Drop any file on infer_contract() — LakeLogic infers the schema, detects null patterns, suggests quality rules from your data's actual distribution, and returns a ContractDraft you can inspect, save, or chain directly into a data generator.

No configuration files. No Python classes to subclass. One function call.

quickstart.py

from lakelogic import infer_contract, DataGenerator

# Infer a full contract from any file
draft = infer_contract(
    "data/orders.csv",
    title="Orders Bronze",
    suggest_rules=True,
    detect_pii=True,
)

draft.show()   # print YAML to inspect
draft.save("contracts/bronze_orders.yaml")

# Generate 10k test rows from the inferred contract
df = draft.to_generator(seed=42).generate(rows=10_000)

# Or skip YAML entirely — pure in-memory flow
df = (
    DataGenerator.from_file("data/orders.csv")
                 .generate(rows=5_000)
)

# Validate on ingest
from lakelogic import DataProcessor
result = DataProcessor("contracts/bronze_orders.yaml").run(df)

CSV

Parquet

Delta

NDJSON

Avro

Excel

ORC

Iceberg

ADLS

GCS

DataFrame

Support

Works with all common data formats

LakeLogic reads and validates any format your lakehouse ingests. Contracts are format-agnostic — the same YAML covers CSV landing files today and Delta tables tomorrow.

Text: CSV, JSON, NDJSON
Binary: Parquet, Delta Lake, Avro, Excel, ORC
Open table formats: Apache Iceberg, Delta Lake
Cloud storage: S3, Azure Blob / ADLS, GCS
In-memory: Polars DataFrame, Pandas DataFrame

From raw file to production pipeline —
in under 10 minutes

Ship your first contract in seconds

Drop any file on infer_contract() — LakeLogic infers schema, nullability, and quality rules from your data's actual distribution.

Catch bad rows before they reach Silver

DataProcessor enforces every rule at runtime. Bad rows route to quarantine with a per-row reason column — no silent failures, no pipeline crashes.

50,000 realistic test rows from one command

DataGenerator produces realistic rows from the same contract — seeded from your real data's distributions. CI pipelines that actually catch issues.

Never reprocess what you've already ingested

Built-in watermark strategies — max target, lookback window, CDC — keep Bronze→Silver runs idempotent automatically. No manual bookmarks.

Know exactly where every bad row came from

Every run stamped with source path, run ID, and timestamp. Full lineage from source file → Bronze → Silver → Gold. Debug in minutes, not days.

Your contracts live in git. So does your trust.

YAML contracts are version-controlled, PR-reviewed, and environment-aware. Compliance teams get an audit trail. Engineers get a review process.

User guide

Discover more
in our docs

Open source

Star us
on GitHub

Pricing

100% free. Forever.

The full framework is open source under the MIT license. No feature gates, no usage limits.

LakeLogic Framework

Free forever · MIT License

Full DataProcessor + DataGenerator
infer_contract() from any file format
Quarantine with per-row error reasons
Incremental watermark strategies
CLI: generate, bootstrap, run, validate
Databricks + Spark + Polars engines
Community support (GitHub Issues)

View on GitHub →

Community

Open

Contribute & collaborate

Full source code on GitHub (MIT)
Issue tracker & feature requests
Pull requests welcome
Example notebooks & playbooks
Changelog & release notes

Join on GitHub →

Open Source. Contract-Driven. Production-Ready.

LakeLogic is MIT-licensed and free forever. Validate, quarantine, and generate data from a single YAML contract.

Release notes & new playbooks. No spam. Unsubscribe any time.

Get started with pip ★ Star on GitHub

Your Lakehouse.
Under Contract.

Quick install

Everything your data lake was missing

Contract-first quality

Embarrassingly fast

Open source

Built by data engineers
to replace four tools with one

Why LakeLogic

From raw file to
contract in one line

Works with all common data formats

From raw file to production pipeline —
in under 10 minutes

Ship your first contract in seconds

Catch bad rows before they reach Silver

50,000 realistic test rows from one command

Never reprocess what you've already ingested

Know exactly where every bad row came from

Your contracts live in git. So does your trust.

Discover more
in our docs

Star us
on GitHub

Data engineering,
without the guesswork

Stop the Spark Tax: One Data Contract, Any Engine

How Quarantine Saved Our Pipeline (And My Sleep)

Data Contracts vs Schema Validation — The Difference Matters

100% free. Forever.

Open Source. Contract-Driven. Production-Ready.

Your Lakehouse.Under Contract.

Quick install

Everything your data lake was missing

Contract-first quality

Embarrassingly fast

Open source

Built by data engineersto replace four tools with one

Why LakeLogic

From raw file tocontract in one line

Works with all common data formats

From raw file to production pipeline —in under 10 minutes

Ship your first contract in seconds

Catch bad rows before they reach Silver

50,000 realistic test rows from one command

Never reprocess what you've already ingested

Know exactly where every bad row came from

Your contracts live in git. So does your trust.

Discover morein our docs

Star uson GitHub

Data engineering,without the guesswork

Stop the Spark Tax: One Data Contract, Any Engine

How Quarantine Saved Our Pipeline (And My Sleep)

Data Contracts vs Schema Validation — The Difference Matters

100% free. Forever.

Open Source. Contract-Driven. Production-Ready.

Your Lakehouse.
Under Contract.

Built by data engineers
to replace four tools with one

From raw file to
contract in one line

From raw file to production pipeline —
in under 10 minutes

Discover more
in our docs

Star us
on GitHub

Data engineering,
without the guesswork