Your Lakehouse.
Under Contract.

Data contracts for the modern lakehouse. One YAML. Any engine.

Open source · MIT License · Polars-native speed · Databricks & Fabric-ready
★ Azure Reference Architecture · Bronze → Silver → Gold · 40+ contracts included

LakeLogic is an open-source, contract-driven data engineering framework. Define your schema once in YAML — and get quality enforcement, quarantine routing, synthetic data generation, incremental processing, and data lineage out of the box. No boilerplate. No sprawl.

Quick install

LakeLogic is easy to install using your standard package manager. Select your environment and get started.

pip install lakelogic
poetry add lakelogic
conda install -c conda-forge lakelogic

Integrates with your stack

Databricks
Snowflake
dbt
Delta Lake
Azure ADLS
AWS S3
BigQuery
Benefits

Everything your data lake was missing

01

Contract-first quality

Define schema, types, nullability, and quality rules once. Every ingest is validated — bad rows quarantined with exact failure reasons, not silent drops.

02

Embarrassingly fast

Polars-native throughout — no Pandas overhead. Validate millions of rows locally in seconds. The same contract runs identically on a laptop or a Databricks cluster.

03

Open source

LakeLogic is and always will be open source — MIT license. The full framework lives on GitHub. Every engineer is encouraged to contribute.

Built by data engineers
to replace four tools with one

Why LakeLogic

Most teams stitch together Great Expectations for quality, Faker for test data, custom scripts for ingestion logic, and a separate orchestrator for lineage. Each has a different config format, different runtime, different failure mode.

LakeLogic replaces all four with a single YAML contract that a data engineer writes once — and the framework handles the rest. The same contract that validates your Bronze ingest also generates 50,000 realistic test rows for your CI pipeline.

Bootstrap from an existing file in one command — the output is always a portable, version-controllable YAML.

LakeLogic 1 contract GE + Faker + custom scripts 4→1 tools replaced
Zero to pipeline

From raw file to
contract in one line

Drop any file on infer_contract() — LakeLogic infers the schema, detects null patterns, suggests quality rules from your data's actual distribution, and returns a ContractDraft you can inspect, save, or chain directly into a data generator.

No configuration files. No Python classes to subclass. One function call.

quickstart.py
from lakelogic import infer_contract, DataGenerator

# Infer a full contract from any file
draft = infer_contract(
    "data/orders.csv",
    title="Orders Bronze",
    suggest_rules=True,
    detect_pii=True,
)

draft.show()   # print YAML to inspect
draft.save("contracts/bronze_orders.yaml")

# Generate 10k test rows from the inferred contract
df = draft.to_generator(seed=42).generate(rows=10_000)

# Or skip YAML entirely — pure in-memory flow
df = (
    DataGenerator.from_file("data/orders.csv")
                 .generate(rows=5_000)
)

# Validate on ingest
from lakelogic import DataProcessor
result = DataProcessor("contracts/bronze_orders.yaml").run(df)
CSV
Parquet
Delta
NDJSON
Avro
Excel
ORC
Iceberg
S3
ADLS
GCS
DataFrame
Support

Works with all common data formats

LakeLogic reads and validates any format your lakehouse ingests. Contracts are format-agnostic — the same YAML covers CSV landing files today and Delta tables tomorrow.

  • Text: CSV, JSON, NDJSON
  • Binary: Parquet, Delta Lake, Avro, Excel, ORC
  • Open table formats: Apache Iceberg, Delta Lake
  • Cloud storage: S3, Azure Blob / ADLS, GCS
  • In-memory: Polars DataFrame, Pandas DataFrame

From raw file to production pipeline
in under 10 minutes

01

Ship your first contract in seconds

Drop any file on infer_contract() — LakeLogic infers schema, nullability, and quality rules from your data's actual distribution.

02

Catch bad rows before they reach Silver

DataProcessor enforces every rule at runtime. Bad rows route to quarantine with a per-row reason column — no silent failures, no pipeline crashes.

03

50,000 realistic test rows from one command

DataGenerator produces realistic rows from the same contract — seeded from your real data's distributions. CI pipelines that actually catch issues.

04

Never reprocess what you've already ingested

Built-in watermark strategies — max target, lookback window, CDC — keep Bronze→Silver runs idempotent automatically. No manual bookmarks.

05

Know exactly where every bad row came from

Every run stamped with source path, run ID, and timestamp. Full lineage from source file → Bronze → Silver → Gold. Debug in minutes, not days.

06

Your contracts live in git. So does your trust.

YAML contracts are version-controlled, PR-reviewed, and environment-aware. Compliance teams get an audit trail. Engineers get a review process.

From the blog

Data engineering,
without the guesswork

Read all posts →
Pricing

100% free. Forever.

The full framework is open source under the MIT license. No feature gates, no usage limits.

LakeLogic Framework
$0
Free forever · MIT License
  • Full DataProcessor + DataGenerator
  • infer_contract() from any file format
  • Quarantine with per-row error reasons
  • Incremental watermark strategies
  • CLI: generate, bootstrap, run, validate
  • Databricks + Spark + Polars engines
  • Community support (GitHub Issues)
View on GitHub →

Open Source. Contract-Driven. Production-Ready.

LakeLogic is MIT-licensed and free forever. Validate, quarantine, and generate data from a single YAML contract.

Release notes & new playbooks. No spam. Unsubscribe any time.

Get started with pip ★ Star on GitHub