refactor(alerts): expand alert evaluation to include all alert-visible sample types fixes #633

This commit is contained in:
Raj Nandan Sharma
2026-06-07 18:43:38 +05:30
parent e63a2f6311
commit ed1a70d75b
5 changed files with 65 additions and 3 deletions
+33
View File
@@ -35,9 +35,42 @@ The stored order in which a Group Monitor's members are checked before aggregati
**Eligible Monitor**:
A monitor that may become a Member: ACTIVE, not a Group Monitor, and not the group being edited itself.
**Monitoring Sample**:
One recorded data point for a monitor at a timestamp: a status, a latency, and a sample type describing how it was produced. Every sample is either Observed or Synthetic.
_Avoid_: Data point, record, check result
**Observed Sample**:
A Monitoring Sample produced by a check that actually ran against the target, whatever the outcome: a clean evaluation (`REALTIME`), a timed-out check (`TIMEOUT`), or a check that errored (`ERROR`). A check failing to reach the target is itself an observation — for most monitor types that is exactly what "down" looks like.
**Synthetic Sample**:
A Monitoring Sample written by the system or an admin rather than by a check: a raw heartbeat receipt (`SIGNAL`), a status pushed through the data API (`MANUAL`), a default-status fill (`DEFAULT_STATUS`), or an incident/maintenance overlay (`INCIDENT`, `MAINTENANCE`).
**Stale Member**:
A Member whose monitor is no longer an Eligible Monitor (paused or deleted after being added). It remains a Member until explicitly removed, but is excluded from the group score.
### Alerting
**Alert Configuration**:
A per-monitor alerting rule: a condition (status, latency, or uptime against a value), a Failure Threshold, a Success Threshold, whether triggering creates an incident, and the Triggers to notify.
_Avoid_: Alert rule, alarm
**Alert**:
A fired instance of an Alert Configuration for a monitor. TRIGGERED when the condition holds, RESOLVED when the resolve condition later holds. May own the incident it created.
_Avoid_: Alarm, notification (that's what Triggers send)
**Trigger**:
A notification channel (email, webhook, Discord, Slack) that Alert Configurations notify on trigger and on resolve.
_Avoid_: Notifier, channel
**Alert-Visible Sample**:
A Monitoring Sample that alert evaluation can see: every Observed Sample, plus data-API pushes (`MANUAL`) and default-status fill (`DEFAULT_STATUS`). Raw heartbeat receipts (`SIGNAL`) and incident/maintenance overlays are never alert-visible — while an overlay is active the alert window freezes (alerts neither trigger nor resolve). All alert conditions (status and latency alike) evaluate the same alert-visible timeline.
**Failure Threshold**:
The number of consecutive Alert-Visible Samples matching the condition required to trigger an Alert.
**Success Threshold**:
The number of consecutive Alert-Visible Samples meeting the resolve condition required to resolve an Alert.
### Pages
**Page**:
@@ -0,0 +1,9 @@
# Alerts evaluate alert-visible samples, not just REALTIME ones
The consecutive-sample checks behind alert evaluation (`consecutivelyStatusFor`, `consecutivelyLatencyGreaterThan`, `consecutivelyLatencyLessThan` in `src/lib/server/db/repositories/monitoring.ts`) consider samples whose type is `REALTIME`, `ERROR`, `TIMEOUT`, `MANUAL`, or `DEFAULT_STATUS` — the "alert-visible" set — instead of `REALTIME` only. Both data-API PATCH endpoints (single timestamp and range) enqueue one alert evaluation after writing `MANUAL` rows. `SIGNAL` rows and `INCIDENT`/`MAINTENANCE` overlay rows remain invisible to alerting.
Two issues drove this. In #633, a GameDig monitor showed DOWN on the status page but never alerted: a down game server makes `GameDig.query` throw, so every down-sample is recorded as `ERROR`, which the old `type = REALTIME` filter excluded — the "N consecutive DOWN" condition could never become true. The same failure mode silently broke gRPC, SQL, and SSL monitors (hard-down records `ERROR`) and API monitors whose outage manifests as timeouts (`TIMEOUT`). In #720, a NONE monitor driven by the data API never alerted for two stacked reasons: PATCH writes `MANUAL` rows the filter excluded, and the endpoint never enqueued evaluation at all. The status page and UPTIME alerts have no type filter, which is why users saw DOWN while alerts stayed silent.
The whitelist is exactly the set of types written by flows that trigger alert evaluation — scheduler checks (`REALTIME`/`ERROR`/`TIMEOUT`), default-status fill (`DEFAULT_STATUS`), and data-API pushes (`MANUAL`). That invariant ("every enqueuer of evaluation contributes a row the evaluator can see") is what keeps the un-time-bounded last-N query self-healing: without `DEFAULT_STATUS` in the set, a NONE monitor with a default status evaluates every minute against rows it cannot see, so stale backfilled `MANUAL` rows would rank as "the last N" and fire alerts weeks after the fact. The rejected alternatives: fixing only `gamedigCall` to emit `REALTIME` on query failure (fixes one monitor type out of five, loses the stored down-vs-errored diagnostic, and does nothing for existing data), and time-bounding the query (the cutoff must scale with each monitor's cron, which means parsing cron expressions in the alert path for marginal benefit).
The trade-offs, accepted deliberately for one uniform rule across status and latency alerts: a service that degrades from slow to hard-down resolves an active latency alert (error samples carry latency 0, satisfying "consecutively below threshold") — the status alert is the one that covers outages; a NONE monitor with a default status auto-resolves MANUAL-pushed alerts once default fill resumes, because a default status is an explicit statement that absence of pushes means that status; and the fix is retroactive, so monitors that were already down at upgrade time alert shortly after — which is the bug report, inverted.
+12 -3
View File
@@ -10,6 +10,15 @@ import type {
TimestampStatusCountByMonitor,
} from "../../types/db.js";
/**
* Sample types alert evaluation can see (see docs/adr/0005-alerts-evaluate-alert-visible-samples.md).
* Exactly the types written by flows that enqueue alert evaluation: scheduler checks
* (REALTIME/ERROR/TIMEOUT), default-status fill (DEFAULT_STATUS), and data-API pushes (MANUAL).
* SIGNAL rows (raw heartbeat receipts) and INCIDENT/MAINTENANCE overlays stay invisible, so the
* alert window freezes during manual overlays instead of triggering or resolving on them.
*/
const ALERT_VISIBLE_TYPES = [GC.REALTIME, GC.ERROR, GC.TIMEOUT, GC.MANUAL, GC.DEFAULT_STATUS];
/**
* Repository for monitoring data operations
*/
@@ -262,7 +271,7 @@ export class MonitoringRepository extends BaseRepository {
qb.select("*")
.from("monitoring_data")
.where("monitor_tag", monitor_tag)
.andWhere("type", "=", GC.REALTIME)
.whereIn("type", ALERT_VISIBLE_TYPES)
.orderBy("timestamp", "desc")
.limit(lastX);
})
@@ -288,7 +297,7 @@ export class MonitoringRepository extends BaseRepository {
qb.select("*")
.from("monitoring_data")
.where("monitor_tag", monitor_tag)
.andWhere("type", "=", GC.REALTIME)
.whereIn("type", ALERT_VISIBLE_TYPES)
.orderBy("timestamp", "desc")
.limit(lastX);
})
@@ -310,7 +319,7 @@ export class MonitoringRepository extends BaseRepository {
qb.select("*")
.from("monitoring_data")
.where("monitor_tag", monitor_tag)
.andWhere("type", "=", GC.REALTIME)
.whereIn("type", ALERT_VISIBLE_TYPES)
.orderBy("timestamp", "desc")
.limit(lastX);
})
@@ -11,6 +11,7 @@ import GC from "$lib/global-constants";
import { UpdateMonitoringData } from "$lib/server/controllers/monitorsController";
import { GetMinuteStartTimestampUTC } from "$lib/server/tool";
import { SetLastMonitoringValue } from "$lib/server/cache/setGet";
import alertingQueue from "$lib/server/queues/alertingQueue";
export const GET: RequestHandler = async ({ locals, url }) => {
// Monitor is validated by middleware and available in locals
@@ -169,6 +170,11 @@ export const PATCH: RequestHandler = async ({ locals, request }) => {
await SetLastMonitoringValue(monitorTag, latestData);
}
// MANUAL samples are alert-visible (docs/adr/0005), so re-evaluate alerts once for the
// last written sample — for NONE monitors nothing else would ever trigger evaluation.
const lastWrittenTs = body.start_ts + Math.floor((body.end_ts - body.start_ts) / 60) * 60;
await alertingQueue.push(monitorTag, lastWrittenTs, body.status);
// Calculate the number of data points that will be returned by GET
// GET uses: timestamp >= start_ts AND timestamp < end_ts
// Data is stored at minute-aligned timestamps
@@ -10,6 +10,7 @@ import type {
import GC from "$lib/global-constants";
import { GetMinuteStartTimestampUTC } from "$lib/server/tool";
import { SetLastMonitoringValue } from "$lib/server/cache/setGet";
import alertingQueue from "$lib/server/queues/alertingQueue";
export const GET: RequestHandler = async ({ params, locals }) => {
// Monitor is validated by middleware and available in locals
@@ -160,6 +161,10 @@ export const PATCH: RequestHandler = async ({ params, locals, request }) => {
await SetLastMonitoringValue(monitorTag, latestData);
}
// MANUAL samples are alert-visible (docs/adr/0005), so re-evaluate alerts for this
// sample — for NONE monitors nothing else would ever trigger evaluation.
await alertingQueue.push(monitorTag, timestamp, status);
// Fetch the updated data
const updatedData = await db.getMonitoringDataAt(monitorTag, timestamp);