What Is the Platform Tax?

If you've evaluated enterprise data quality tools — Informatica Data Quality, Talend, Ataccama, Collibra — you've seen the pattern. They solve real problems, but they come with hidden costs that compound over time:

๐Ÿ’ณ Per-Seat Licensing

Every data engineer who needs to write or modify a quality rule needs a license. At $500โ€“2,000/user/month, a team of 10 engineers costs $60Kโ€“240K/year before you write a single rule.

๐Ÿ”’ Proprietary Rule Format

Your quality rules live in a vendor-specific format — a GUI, a custom DSL, or a database. You can't version-control them. You can't code-review them. You can't run them in CI/CD.

๐Ÿ—๏ธ Dedicated Infrastructure

Most platforms require their own servers, agents, or cloud instances. Your data leaves your pipeline to be validated externally, then returns. More hops, more latency, more blast radius.

๐Ÿ”„ Vendor Lock-in

After 2 years and 500 rules, migration is effectively impossible. The vendor knows this. Your renewal price reflects it.

This is the platform tax: the ongoing cost of using a tool that solves a code problem with infrastructure. Data quality management doesn't need its own platform. It needs the right abstraction.

The best data quality tool is the one your data engineers already control — code in version control, validated at compute time, with zero licensing cost.

What You Actually Need from DQM

Strip away the marketing and enterprise sales deck. Data quality management for a modern data team comes down to five capabilities:

  1. Schema enforcement. Types, nullability, field presence — validated before data reaches downstream consumers.
  2. Row-level quality rules. Business logic like "amount must be non-negative" or "status must be a valid enum" — applied to every row, not sampled.
  3. Quarantine. Bad rows isolated with exact failure reasons, not dropped silently or mixed into production tables.
  4. Lineage. Know where every row came from, when it was processed, and which pipeline run produced it.
  5. Engine portability. Rules that work on Polars, Spark, DuckDB, or Snowflake — not rules locked to one vendor's runtime.

Every one of these is expressible in YAML + SQL. No proprietary platform required.

YAML Contracts: DQM as Code

A data contract is a declarative definition of what "valid data" means for a dataset. It's a YAML file that lives in your Git repo, gets reviewed in pull requests, and runs in your pipeline — exactly where the data is:

contracts/orders.yaml
info:
  version: 1
  name: "orders_silver"
  domain: "crm"
  system: "shopify"
  owner: "crm-engineering@company.com"

model:
  fields:
    - name: order_id
      type: string
      required: true
    - name: amount
      type: float
    - name: status
      type: string

quality:
  row_rules:
    - sql: "amount >= 0"
    - accepted_values:
        field: status
        values: ["pending", "shipped", "delivered", "returned"]
  dataset_rules:
    - unique: order_id
    - null_ratio:
        field: status
        max: 0.05

quarantine:
  enabled: true
  include_error_reason: true

lineage:
  enabled: true

This single YAML file replaces what would take a GUI wizard, a proprietary rule engine, and a separate quarantine infrastructure in a traditional DQM platform.

Platform Tax vs. YAML Contracts: Side by Side

Dimension Enterprise DQM Platform YAML Data Contracts
Cost $60Kโ€“500K/year licensing $0 — MIT open source
Rule authoring Proprietary GUI or DSL YAML + standard SQL in your IDE
Version control Export/import or API workaround Native Git — PR reviews, branching, blame
CI/CD integration Requires API orchestration lakelogic run in any CI pipeline
Engine support Vendor's runtime only Polars, Spark, DuckDB, Pandas, Snowflake, BigQuery
Quarantine Log-based or report-based Row-level isolation with per-row error reasons
Lineage Separate lineage tool required Built-in: run ID, source path, timestamp, domain
Vendor migration 6โ€“12 month project N/A — you own the YAML

What About Governance Frameworks?

A common objection: "We need a governance framework, not just validation." Fair point. Data governance encompasses data cataloging, access control, policy management, and compliance reporting. LakeLogic doesn't replace your catalog or your IAM layer.

But here's what most teams discover: the enforcement layer of their governance framework — the part that actually validates data at runtime — is the part that's most expensive and most locked-in. That's the part YAML contracts replace.

Data Lineage Without a Separate Tool

Enterprise lineage tools (Atlan, Alation, DataHub) solve discovery and visualization. But the raw lineage data — "where did this row come from?" — should be injected at compute time, not scraped after the fact.

With lineage.enabled: true in your contract, every output row automatically gets lineage columns:

Your catalog tool can read these columns to build lineage graphs. No scraping. No API integration. The lineage is in the data.

Getting Started: Replace One Pipeline

You don't need to rip out your existing DQM platform overnight. Start with one pipeline:

terminal
# Install (no license key, no signup, no infrastructure)
pip install lakelogic

# Auto-generate a contract from your existing data
lakelogic bootstrap --source data/orders.parquet --output contracts/ \
    --suggest-rules

# Run it
lakelogic run --contract contracts/orders.yaml --source data/orders.parquet

# See what failed and why
lakelogic run --contract contracts/orders.yaml --source data/orders.parquet \
    --show-quarantine

If the contract catches the same issues your DQM platform catches — and it will — you've just replaced a six-figure license with a pip install.

Your Quality Rules. Version-Controlled. Free.

Stop paying platform tax for data quality. Define your rules in YAML, run them on any engine, review them in Git.