Your Data Pipeline Is Lying to You

data-engineeringobservabilityqualityreliability

The most expensive pipeline failures are usually the quiet ones. Jobs complete, dashboards load, and nobody notices the problem until decisions have already been made from bad data. By then, the issue is not just technical. It is organizational trust debt.

This happens when teams treat execution success as a proxy for data correctness.

A green orchestrator run means the code executed. It does not mean the output is true.

Execution health versus data truth

These are related but different signals, and conflating them is where many teams get hurt.

Signal categoryWhat it confirmsWhat it does not confirm
Execution healthJobs ran, tasks completed, retries convergedCorrectness of business logic and data semantics
Data truthFreshness, completeness, consistency, contract validityWhether infrastructure remained fully available

You need both. Monitoring only one creates blind spots that surface later as business incidents.

Why "green" pipelines still produce bad output

Most production issues in this category are predictable: silent row drops from schema mismatch, merge-key collisions, null expansion in critical dimensions, and stale upstream extracts that still pass structural checks. None of these necessarily fail an orchestrator task.

That is why runtime status dashboards can look healthy while downstream metrics drift.

A practical quality gate example

Quality checks need to run as first-class deployment gates, not optional observability add-ons.

-- Fail publish if daily order volume drops unexpectedly
WITH baseline AS (
  SELECT AVG(order_count) AS avg_orders
  FROM quality_daily_orders
  WHERE ds BETWEEN DATE_SUB(CURRENT_DATE, 14) AND DATE_SUB(CURRENT_DATE, 1)
),
today AS (
  SELECT COUNT(*) AS order_count
  FROM silver_orders
  WHERE ds = CURRENT_DATE
)
SELECT
  CASE
    WHEN today.order_count < (baseline.avg_orders * 0.65) THEN 'FAIL'
    ELSE 'PASS'
  END AS quality_status,
  today.order_count,
  baseline.avg_orders
FROM today, baseline;

The exact threshold varies by domain, but the pattern is consistent: publish decisions should be conditioned on quality outcomes, not only job completion.

Triage is fast when the payload is structured

When quality fails, the incident payload should already include enough context to route work and reduce guesswork.

{
  "run_id": "orders-2025-08-21-02",
  "dataset": "gold_orders_daily",
  "quality_check": "volume_drop_guardrail",
  "status": "FAIL",
  "observed": 4182,
  "expected_floor": 6930,
  "upstream_candidate": "silver_orders_parse_errors_spike",
  "recommended_action": "hold_publish_and_backfill_after_parse_fix"
}

This kind of payload turns incident handling into a directed workflow instead of an open-ended investigation.

Where observability should go deeper

If you want faster root-cause detection, instrument data movement and transformation semantics, not only task states. Stage-level row transitions, rejection reason distributions, schema change deltas, and lineage from source object to serving table usually provide the fastest path to isolation.

Lineage in particular is operationally critical. It tells you exactly where bad data entered, where it was amplified, and which downstream assets need rollback or backfill.

A rollout strategy that sticks

The most durable pattern is incremental. Start with one revenue- or operations-critical pipeline, add a small set of high-signal checks, wire them into publish controls, and attach runbooks with explicit ownership. Once that loop works reliably, expand to adjacent domains.

Broad mandates without enforcement often create documentation growth without reliability improvement.

Final note

Pipelines become trustworthy when they can prove output correctness under change. Runtime reliability still matters, but semantic reliability is the signal business teams actually care about. Treating those as separate concerns is what keeps data platforms credible at scale.

Contact

Questions, feedback, or project ideas. I read every message.