System Monitors

Revefi automatically creates system monitors for datasets. This page describes what those system monitors are.

System-level monitors are auto-created for discovered assets. Configuration is not required. Revefi periodically recalibrates thresholds to maintain accuracy.

Table Monitors

Supported data sources: Snowflake BigQuery Databricks Redshift

Monitors are created automatically for discovered tables. You can exclude specific tables from automatic monitoring. The following monitors are created by default:

Total Row Count

Detects anomalies in a table’s overall size—flagging unexpected flatlines, drops, or spikes in total rows compared to learned historical patterns.

  • What it catches: unexpected deletions, stalled loads, duplicate/backfill inserts, or abnormal surges.
  • How it works: thresholds are auto-tuned from recent behavior; alerts trigger on significant deviation from the expected row count trend.

Detection Scenarios

  • A nightly ingest for fact_sales normally adds ~1–2M rows. One run adds 0 rows → the monitor alerts, indicating an upstream job failure or missing partition.
  • A backfill script reruns without proper idempotency and doubles events table rows → the monitor flags the spike before downstream metrics are distorted.

Incremental Row Count

Tracks row-level write activity (adds, updates) between intervals—surfacing abnormal low/high activity. Note: Incremental Row Count can differ from the change in total rows for the same period (e.g., due to compaction, deduplication, late-arriving data, or CDC merges).

  • What it catches: partial or duplicate loads, missing partitions, unexpected delete-heavy runs, silent truncations, and runaway backfills.
  • How it works: Learns typical increments over recent history and auto-tunes thresholds; flags negative or zero increments when growth is expected, and unusually large spikes relative to baseline.

Detection Scenarios

  • A daily ingest for events typically adds +50–60M rows. One run reports –10M incremental rows → the monitor flags a likely truncation or bulk delete.
  • An upsert pipeline reprocesses day D twice → incremental shows +120M versus the usual +55M → the monitor flags a duplicate write before downstream costs and metrics balloon.
  • Late-arriving data replaces stale records (delete+insert) → total row count remains flat, but incremental activity spikes → the monitor surfaces the anomaly that a total-row check would miss.

Total Load Bytes Processed

Monitors the total bytes scanned/processed during data load operations—highlighting unexpected increases or drops relative to learned baselines.

  • What it catches: cost spikes, sudden data volume changes, inefficient transformations, and performance regressions (e.g., lost pruning or compression).
  • How it works: Learns typical byte ranges from recent history and auto-tunes thresholds; flags outliers (both high and low) against expected patterns.

Detection Scenarios

  • A daily load for fact_sales usually processes 120–150 GB. Today it processes 420 GB → the monitor alerts, indicating a query plan change (e.g., partition pruning disabled) or an uncompressed source file drop.
  • A pipeline normally processes 80 GB, but a code change reduces it to 5 GB → the monitor flags an abnormal drop that could mean missing partitions or filtered-out data, preventing silent data loss despite a “successful” run.

Daily Load Count

Monitors the number of write operations (e.g., INSERT, MERGE/UPSERT/...) that load data into a table each day.

  • What it catches: missing scheduled jobs, skipped ETL runs, unexpected duplicate loads, off-schedule reruns, and runaway backfills.
  • How it works: Learns the expected load cadence (e.g., 1× daily, 24× hourly) and auto-tunes thresholds; flags zero when loads are expected, and flags spikes when counts exceed the baseline.

Detection Scenarios

  • fact_orders is expected to load 24 times/day (hourly). Today shows 0 loads → the monitor alerts, indicating a failed scheduler or upstream outage.
  • events usually loads once per day. Today shows 3 loads → the monitor flags unexpected duplicate/off-schedule runs that could inflate downstream metrics or costs.

Freshness

Monitors the timeliness of a table by tracking how long it has been since the last successful update/load and comparing that latency to the expected cadence or SLA.

  • What it catches: upstream pipeline failures, paused/failed schedulers, stuck streams, source outages, and broken data feeds before they affect downstream users.
  • How it works: Learns typical update intervals and auto-tunes thresholds. Enabled for tables with regular updates; not applied to static/reference tables/tables with constant updates where freshness does not meaningfully apply.

Detection Scenarios

  • fact_sales normally finishes the 02:00 UTC daily load by 02:15. At 10:00, no update is recorded → the monitor flags a freshness breach so downstream dashboards don’t publish stale figures.

Pipeline Monitors

Supported data sources: Composer

Monitors are created automatically for discovered pipelines. The following monitors are created by default:

Execution Time

Monitors the duration of DAG runs and/or critical tasks—flagging abnormal slowdowns or speedups relative to learned baselines.

  • What it catches: performance regressions, stuck/long-running tasks, retries that inflate runtime, inefficient query plans, upstream/backfill pressure, and resource contention that can lead to SLA misses.
  • How it works: Learns typical runtimes by schedule/day-of-week and auto-tunes thresholds; optionally checks against explicit Airflow SLAs. Flags outliers in both directions (unusually long or suspiciously short runs).

Detection Scenarios

  • A nightly etl_sales_dag usually completes in 18–22 minutes. Today it takes 58 minutes → the monitor alerts, indicating a possible BigQuery scan regression or downstream API latency that may miss the publishing SLA.
  • events_backfill_dag typically runs in ~4 minutes. A run completes in 20 seconds → the monitor flags an abnormally short duration suggesting tasks were skipped (e.g., empty partition or misconfigured filter), preventing silent under-processing.

Queue Time

Monitors the delay between when a DAG run or task is scheduled/queued and when it actually starts executing in Airflow—surfacing abnormal scheduler or worker backlog.

  • What it catches: scheduler bottlenecks, insufficient worker slots, over-subscribed pools/queues, GKE/worker cold starts, and upstream bursts that cause SLA risk even before execution begins.
  • How it works: Learns typical queue times by hour/day and auto-tunes thresholds; optionally checks against explicit SLAs or max queue targets. Flags both prolonged delays and unusually short queue times that may indicate mis-scheduling or skipped concurrency controls.

Detection Scenarios

  • During the morning peak, hourly_events_dag tasks wait 12–15 minutes in queued (baseline <1 min) → the monitor alerts on a worker shortage or pool limit, allowing you to raise concurrency or scale Composer before SLAs are missed.
  • An off-peak run shows 0s queue time when the pool should limit concurrency → the monitor flags a configuration drift (e.g., task not assigned to the intended pool/queue), preventing unexpected cluster pressure.

Total Run Count

Monitors how many Airflow DAG runs occur in a given interval (e.g., per hour or per day) and compares the count to the expected schedule/cadence.

  • What it catches: missing/silent skips, paused/failed schedulers, unexpected duplicate/off-schedule triggers, runaway reruns/retries, and excessive backfill activity that may strain resources.
  • How it works: Learns the expected cadence (e.g., 1× daily, 24× hourly) and auto-tunes thresholds; flags zero/low counts when runs are expected and spikes when counts exceed baseline. Supports filters by DAG, owner, queue, or tag.

Detection Scenarios

  • hourly_events_dag should run 24×/day. Today shows 0 runs → the monitor alerts, indicating a paused DAG or scheduler outage before downstream data goes stale.
  • fact_sales_daily normally runs once/day. Today shows 3 runs (one manual, two retries) → the monitor flags an abnormal spike so you can investigate duplicate triggers or unstable tasks.

Failed Run Count

Monitors the number of failed Airflow DAG runs within a given interval (e.g., hourly, daily) and compares it to the expected baseline or SLO.

  • What it catches: rising failure rates, bursty/intermittent failures, systemic outages (e.g., credentials/quotas), schema or dependency changes, and bad deploys causing coordinated run failures.
  • How it works: Learns typical failure counts/rates and auto-tunes thresholds; flags spikes above baseline and sustained elevation over rolling periods. Can pair with alerts on error classes (e.g., AirflowException, HttpError, BigQuery job errors).

Detection Scenarios

  • fact_sales_daily normally has 0 failed runs. Today shows 3 failed runs after a code release → the monitor alerts, pointing to a likely breaking change (e.g., column rename or dependency upgrade).
  • hourly_events_dag averages ≤1 failed run/day. The last 6 hours show 5 failures clustered around rate-limit errors → the monitor flags a systemic issue (API quota/credentials) before SLAs are missed.

What’s Next

Create custom monitors for your specific business use-case.