System Monitors
Revefi automatically creates system monitors for datasets. This page describes what those system monitors are.
System-level monitors are auto-created for discovered assets. Configuration is not required. Revefi periodically recalibrates thresholds to maintain accuracy.
Table Monitors
Supported data sources:
![]()
Monitors are created automatically for discovered tables. You can exclude specific tables from automatic monitoring. The following monitors are created by default:
Total Row Count
Detects anomalies in a table’s overall size—flagging unexpected flatlines, drops, or spikes in total rows compared to learned historical patterns.
- What it catches: unexpected deletions, stalled loads, duplicate/backfill inserts, or abnormal surges.
- How it works: thresholds are auto-tuned from recent behavior; alerts trigger on significant deviation from the expected row count trend.
Detection Scenarios
- A nightly ingest for
fact_salesnormally adds ~1–2M rows. One run adds 0 rows → the monitor alerts, indicating an upstream job failure or missing partition. - A backfill script reruns without proper idempotency and doubles
eventstable rows → the monitor flags the spike before downstream metrics are distorted.
Incremental Row Count
Tracks row-level write activity (adds, updates) between intervals—surfacing abnormal low/high activity. Note: Incremental Row Count can differ from the change in total rows for the same period (e.g., due to compaction, deduplication, late-arriving data, or CDC merges).
- What it catches: partial or duplicate loads, missing partitions, unexpected delete-heavy runs, silent truncations, and runaway backfills.
- How it works: Learns typical increments over recent history and auto-tunes thresholds; flags negative or zero increments when growth is expected, and unusually large spikes relative to baseline.
Detection Scenarios
- A daily ingest for
eventstypically adds +50–60M rows. One run reports –10M incremental rows → the monitor flags a likely truncation or bulk delete. - An upsert pipeline reprocesses day D twice → incremental shows +120M versus the usual +55M → the monitor flags a duplicate write before downstream costs and metrics balloon.
- Late-arriving data replaces stale records (delete+insert) → total row count remains flat, but incremental activity spikes → the monitor surfaces the anomaly that a total-row check would miss.
Total Load Bytes Processed
Monitors the total bytes scanned/processed during data load operations—highlighting unexpected increases or drops relative to learned baselines.
- What it catches: cost spikes, sudden data volume changes, inefficient transformations, and performance regressions (e.g., lost pruning or compression).
- How it works: Learns typical byte ranges from recent history and auto-tunes thresholds; flags outliers (both high and low) against expected patterns.
Detection Scenarios
- A daily load for
fact_salesusually processes 120–150 GB. Today it processes 420 GB → the monitor alerts, indicating a query plan change (e.g., partition pruning disabled) or an uncompressed source file drop. - A pipeline normally processes 80 GB, but a code change reduces it to 5 GB → the monitor flags an abnormal drop that could mean missing partitions or filtered-out data, preventing silent data loss despite a “successful” run.
Daily Load Count
Monitors the number of write operations (e.g., INSERT, MERGE/UPSERT/...) that load data into a table each day.
- What it catches: missing scheduled jobs, skipped ETL runs, unexpected duplicate loads, off-schedule reruns, and runaway backfills.
- How it works: Learns the expected load cadence (e.g., 1× daily, 24× hourly) and auto-tunes thresholds; flags zero when loads are expected, and flags spikes when counts exceed the baseline.
Detection Scenarios
fact_ordersis expected to load 24 times/day (hourly). Today shows 0 loads → the monitor alerts, indicating a failed scheduler or upstream outage.eventsusually loads once per day. Today shows 3 loads → the monitor flags unexpected duplicate/off-schedule runs that could inflate downstream metrics or costs.
Freshness
Monitors the timeliness of a table by tracking how long it has been since the last successful update/load and comparing that latency to the expected cadence or SLA.
- What it catches: upstream pipeline failures, paused/failed schedulers, stuck streams, source outages, and broken data feeds before they affect downstream users.
- How it works: Learns typical update intervals and auto-tunes thresholds. Enabled for tables with regular updates; not applied to static/reference tables/tables with constant updates where freshness does not meaningfully apply.
Detection Scenarios
fact_salesnormally finishes the 02:00 UTC daily load by 02:15. At 10:00, no update is recorded → the monitor flags a freshness breach so downstream dashboards don’t publish stale figures.
Pipeline Monitors
Supported data sources: ![]()
Monitors are created automatically for discovered pipelines. The following monitors are created by default:
Execution Time
Monitors the duration of DAG runs and/or critical tasks—flagging abnormal slowdowns or speedups relative to learned baselines.
- What it catches: performance regressions, stuck/long-running tasks, retries that inflate runtime, inefficient query plans, upstream/backfill pressure, and resource contention that can lead to SLA misses.
- How it works: Learns typical runtimes by schedule/day-of-week and auto-tunes thresholds; optionally checks against explicit Airflow SLAs. Flags outliers in both directions (unusually long or suspiciously short runs).
Detection Scenarios
- A nightly
etl_sales_dagusually completes in 18–22 minutes. Today it takes 58 minutes → the monitor alerts, indicating a possible BigQuery scan regression or downstream API latency that may miss the publishing SLA. events_backfill_dagtypically runs in ~4 minutes. A run completes in 20 seconds → the monitor flags an abnormally short duration suggesting tasks were skipped (e.g., empty partition or misconfigured filter), preventing silent under-processing.
Queue Time
Monitors the delay between when a DAG run or task is scheduled/queued and when it actually starts executing in Airflow—surfacing abnormal scheduler or worker backlog.
- What it catches: scheduler bottlenecks, insufficient worker slots, over-subscribed pools/queues, GKE/worker cold starts, and upstream bursts that cause SLA risk even before execution begins.
- How it works: Learns typical queue times by hour/day and auto-tunes thresholds; optionally checks against explicit SLAs or max queue targets. Flags both prolonged delays and unusually short queue times that may indicate mis-scheduling or skipped concurrency controls.
Detection Scenarios
- During the morning peak,
hourly_events_dagtasks wait 12–15 minutes inqueued(baseline <1 min) → the monitor alerts on a worker shortage or pool limit, allowing you to raise concurrency or scale Composer before SLAs are missed. - An off-peak run shows 0s queue time when the pool should limit concurrency → the monitor flags a configuration drift (e.g., task not assigned to the intended pool/queue), preventing unexpected cluster pressure.
Total Run Count
Monitors how many Airflow DAG runs occur in a given interval (e.g., per hour or per day) and compares the count to the expected schedule/cadence.
- What it catches: missing/silent skips, paused/failed schedulers, unexpected duplicate/off-schedule triggers, runaway reruns/retries, and excessive backfill activity that may strain resources.
- How it works: Learns the expected cadence (e.g., 1× daily, 24× hourly) and auto-tunes thresholds; flags zero/low counts when runs are expected and spikes when counts exceed baseline. Supports filters by DAG, owner, queue, or tag.
Detection Scenarios
hourly_events_dagshould run 24×/day. Today shows 0 runs → the monitor alerts, indicating a paused DAG or scheduler outage before downstream data goes stale.fact_sales_dailynormally runs once/day. Today shows 3 runs (one manual, two retries) → the monitor flags an abnormal spike so you can investigate duplicate triggers or unstable tasks.
Failed Run Count
Monitors the number of failed Airflow DAG runs within a given interval (e.g., hourly, daily) and compares it to the expected baseline or SLO.
- What it catches: rising failure rates, bursty/intermittent failures, systemic outages (e.g., credentials/quotas), schema or dependency changes, and bad deploys causing coordinated run failures.
- How it works: Learns typical failure counts/rates and auto-tunes thresholds; flags spikes above baseline and sustained elevation over rolling periods. Can pair with alerts on error classes (e.g.,
AirflowException,HttpError,BigQueryjob errors).
Detection Scenarios
fact_sales_dailynormally has 0 failed runs. Today shows 3 failed runs after a code release → the monitor alerts, pointing to a likely breaking change (e.g., column rename or dependency upgrade).hourly_events_dagaverages ≤1 failed run/day. The last 6 hours show 5 failures clustered around rate-limit errors → the monitor flags a systemic issue (API quota/credentials) before SLAs are missed.
Updated about 4 hours ago
Create custom monitors for your specific business use-case.
