2.9 KiB
Alerts evaluate alert-visible samples, not just REALTIME ones
The consecutive-sample checks behind alert evaluation (consecutivelyStatusFor, consecutivelyLatencyGreaterThan, consecutivelyLatencyLessThan in src/lib/server/db/repositories/monitoring.ts) consider samples whose type is REALTIME, ERROR, TIMEOUT, MANUAL, or DEFAULT_STATUS — the "alert-visible" set — instead of REALTIME only. Both data-API PATCH endpoints (single timestamp and range) enqueue one alert evaluation after writing MANUAL rows. SIGNAL rows and INCIDENT/MAINTENANCE overlay rows remain invisible to alerting.
Two issues drove this. In #633, a GameDig monitor showed DOWN on the status page but never alerted: a down game server makes GameDig.query throw, so every down-sample is recorded as ERROR, which the old type = REALTIME filter excluded — the "N consecutive DOWN" condition could never become true. The same failure mode silently broke gRPC, SQL, and SSL monitors (hard-down records ERROR) and API monitors whose outage manifests as timeouts (TIMEOUT). In #720, a NONE monitor driven by the data API never alerted for two stacked reasons: PATCH writes MANUAL rows the filter excluded, and the endpoint never enqueued evaluation at all. The status page and UPTIME alerts have no type filter, which is why users saw DOWN while alerts stayed silent.
The whitelist is exactly the set of types written by flows that trigger alert evaluation — scheduler checks (REALTIME/ERROR/TIMEOUT), default-status fill (DEFAULT_STATUS), and data-API pushes (MANUAL). That invariant ("every enqueuer of evaluation contributes a row the evaluator can see") is what keeps the un-time-bounded last-N query self-healing: without DEFAULT_STATUS in the set, a NONE monitor with a default status evaluates every minute against rows it cannot see, so stale backfilled MANUAL rows would rank as "the last N" and fire alerts weeks after the fact. The rejected alternatives: fixing only gamedigCall to emit REALTIME on query failure (fixes one monitor type out of five, loses the stored down-vs-errored diagnostic, and does nothing for existing data), and time-bounding the query (the cutoff must scale with each monitor's cron, which means parsing cron expressions in the alert path for marginal benefit).
The trade-offs, accepted deliberately for one uniform rule across status and latency alerts: a service that degrades from slow to hard-down resolves an active latency alert (error samples carry latency 0, satisfying "consecutively below threshold") — the status alert is the one that covers outages; a NONE monitor with a default status auto-resolves MANUAL-pushed alerts once default fill resumes, because a default status is an explicit statement that absence of pushes means that status; and the fix is retroactive, so monitors that were already down at upgrade time alert shortly after — which is the bug report, inverted.