What Is the Platform Tax?
If you've evaluated enterprise data quality tools — Informatica Data Quality, Talend, Ataccama, Collibra — you've seen the pattern. They solve real problems, but they come with hidden costs that compound over time:
๐ณ Per-Seat Licensing
Every data engineer who needs to write or modify a quality rule needs a license. At $500โ2,000/user/month, a team of 10 engineers costs $60Kโ240K/year before you write a single rule.
๐ Proprietary Rule Format
Your quality rules live in a vendor-specific format — a GUI, a custom DSL, or a database. You can't version-control them. You can't code-review them. You can't run them in CI/CD.
๐๏ธ Dedicated Infrastructure
Most platforms require their own servers, agents, or cloud instances. Your data leaves your pipeline to be validated externally, then returns. More hops, more latency, more blast radius.
๐ Vendor Lock-in
After 2 years and 500 rules, migration is effectively impossible. The vendor knows this. Your renewal price reflects it.
This is the platform tax: the ongoing cost of using a tool that solves a code problem with infrastructure. Data quality management doesn't need its own platform. It needs the right abstraction.
The best data quality tool is the one your data engineers already control — code in version control, validated at compute time, with zero licensing cost.
What You Actually Need from DQM
Strip away the marketing and enterprise sales deck. Data quality management for a modern data team comes down to five capabilities:
- Schema enforcement. Types, nullability, field presence — validated before data reaches downstream consumers.
- Row-level quality rules. Business logic like "amount must be non-negative" or "status must be a valid enum" — applied to every row, not sampled.
- Quarantine. Bad rows isolated with exact failure reasons, not dropped silently or mixed into production tables.
- Lineage. Know where every row came from, when it was processed, and which pipeline run produced it.
- Engine portability. Rules that work on Polars, Spark, DuckDB, or Snowflake — not rules locked to one vendor's runtime.
Every one of these is expressible in YAML + SQL. No proprietary platform required.
YAML Contracts: DQM as Code
A data contract is a declarative definition of what "valid data" means for a dataset. It's a YAML file that lives in your Git repo, gets reviewed in pull requests, and runs in your pipeline — exactly where the data is:
info: version: 1 name: "orders_silver" domain: "crm" system: "shopify" owner: "crm-engineering@company.com" model: fields: - name: order_id type: string required: true - name: amount type: float - name: status type: string quality: row_rules: - sql: "amount >= 0" - accepted_values: field: status values: ["pending", "shipped", "delivered", "returned"] dataset_rules: - unique: order_id - null_ratio: field: status max: 0.05 quarantine: enabled: true include_error_reason: true lineage: enabled: true
This single YAML file replaces what would take a GUI wizard, a proprietary rule engine, and a separate quarantine infrastructure in a traditional DQM platform.
Platform Tax vs. YAML Contracts: Side by Side
| Dimension | Enterprise DQM Platform | YAML Data Contracts |
|---|---|---|
| Cost | $60Kโ500K/year licensing | $0 — MIT open source |
| Rule authoring | Proprietary GUI or DSL | YAML + standard SQL in your IDE |
| Version control | Export/import or API workaround | Native Git — PR reviews, branching, blame |
| CI/CD integration | Requires API orchestration | lakelogic run in any CI pipeline |
| Engine support | Vendor's runtime only | Polars, Spark, DuckDB, Pandas, Snowflake, BigQuery |
| Quarantine | Log-based or report-based | Row-level isolation with per-row error reasons |
| Lineage | Separate lineage tool required | Built-in: run ID, source path, timestamp, domain |
| Vendor migration | 6โ12 month project | N/A — you own the YAML |
What About Governance Frameworks?
A common objection: "We need a governance framework, not just validation." Fair point. Data governance encompasses data cataloging, access control, policy management, and compliance reporting. LakeLogic doesn't replace your catalog or your IAM layer.
But here's what most teams discover: the enforcement layer of their governance framework — the part that actually validates data at runtime — is the part that's most expensive and most locked-in. That's the part YAML contracts replace.
- Cataloging — keep your existing catalog (Unity Catalog, Datahub, Collibra).
LakeLogic's
info.domainandinfo.systemmetadata fields integrate with any catalog via the contract YAML. - Access control — keep your IAM. Contracts define what valid data looks like, not who can see it.
- Quality enforcement — this is what contracts replace. SQL rules, quarantine, lineage — all declarative, all version-controlled, all free.
Data Lineage Without a Separate Tool
Enterprise lineage tools (Atlan, Alation, DataHub) solve discovery and visualization. But the raw lineage data — "where did this row come from?" — should be injected at compute time, not scraped after the fact.
With lineage.enabled: true in your contract, every output row automatically
gets lineage columns:
_lakelogic_run_id— unique run identifier_lakelogic_source— source file or table path_lakelogic_processed_at— processing timestamp_lakelogic_domain— domain from contract metadata_lakelogic_system— source system from contract metadata
Your catalog tool can read these columns to build lineage graphs. No scraping. No API integration. The lineage is in the data.
Getting Started: Replace One Pipeline
You don't need to rip out your existing DQM platform overnight. Start with one pipeline:
# Install (no license key, no signup, no infrastructure) pip install lakelogic # Auto-generate a contract from your existing data lakelogic bootstrap --source data/orders.parquet --output contracts/ \ --suggest-rules # Run it lakelogic run --contract contracts/orders.yaml --source data/orders.parquet # See what failed and why lakelogic run --contract contracts/orders.yaml --source data/orders.parquet \ --show-quarantine
If the contract catches the same issues your DQM platform catches — and it will —
you've just replaced a six-figure license with a pip install.
Your Quality Rules. Version-Controlled. Free.
Stop paying platform tax for data quality. Define your rules in YAML, run them on any engine, review them in Git.