feat(confirmation): preserve observed error on held rows; append confirmation note on backfill (#756)

Held (pending) rows now keep the real error text tagged '| Status held during
grace period' instead of dropping it, so no diagnostic info is lost. On confirmation
the backfill appends '| Down confirmed after N consecutive checks' to the existing
text (pipe-separated) rather than overwriting it; recovery clears the error. Append
is per-row for cross-DB safety and idempotent on replay.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Raj Nandan Sharma
2026-06-13 20:20:40 +05:30
parent e61873164b
commit f0362fd919
3 changed files with 46 additions and 10 deletions
@@ -8,6 +8,8 @@ Damping happens at **write time**, not read time. The rejected alternative — r
The flip is **retroactive**, not forward-only. Forward-only (first N1 failing samples stay `UP` forever, `DOWN` begins at confirmation) is simpler — history would stay append-only — but it systematically shaves N1 minutes off the front of every real outage, permanently flattering uptime and misreporting outage start times. Retroactive backfill rewrites a bounded window (at most threshold1 rows, one `UPDATE`) at the moment of crossing; it also makes the threshold compose with the alert Failure Threshold by `max` rather than `+`, because once backfilled the alert lookback sees N consecutive confirmed rows immediately. The cost accepted: `monitoring_data` is no longer append-only over the pending window, so the last-value cache and any reader that observed those rows mid-window saw the pre-flip side until convergence.
A held (pending) row keeps the check's real measured latency and observed error text — no diagnostic information is discarded — tagged with `| Status held during grace period`. On confirmation the backfill preserves that text and appends `| Down confirmed after N consecutive checks` (rather than overwriting it); a confirmed recovery clears the error, since the row becomes the UP side. `raw_status` remains the canonical record of what each check observed.
Deliberately accepted boundaries, chosen for one uniform rule rather than special cases:
- **Count, not minutes.** The threshold is consecutive observations, well-defined for any cron, matching the alerting thresholds' unit; for the common every-minute cron, count and minutes coincide.
+37 -6
View File
@@ -362,14 +362,45 @@ export class MonitoringRepository extends BaseRepository {
message: string | null,
): Promise<number> {
if (timestamps.length === 0) return 0;
return await this.knex("monitoring_data")
// Recovery (confirmed UP): rows become the UP side — clear any held error text in one update.
if (message === null) {
return await this.knex("monitoring_data")
.where("monitor_tag", monitor_tag)
.whereIn("timestamp", timestamps)
.whereNotNull("raw_status")
.update({
status: this.knex.ref("raw_status"),
error_message: null,
});
}
// Confirmed unhealthy: set each row's status from its observed raw_status and APPEND the
// confirmation note to the existing error text (preserving the observed failure reason).
// Done per-row for portable string concatenation (|| vs CONCAT differ across SQLite/PG/MySQL)
// and to stay idempotent if the backfill is ever replayed.
const rows = await this.knex("monitoring_data")
.select("timestamp", "error_message", "raw_status")
.where("monitor_tag", monitor_tag)
.whereIn("timestamp", timestamps)
.whereNotNull("raw_status")
.update({
status: this.knex.ref("raw_status"),
error_message: message,
});
.whereNotNull("raw_status");
let updated = 0;
for (const row of rows) {
const existing: string | null = row.error_message;
let nextMessage: string;
if (!existing) {
nextMessage = message;
} else if (existing.indexOf(message) !== -1) {
nextMessage = existing; // already appended — keep idempotent
} else {
nextMessage = `${existing} | ${message}`;
}
updated += await this.knex("monitoring_data")
.where({ monitor_tag, timestamp: row.timestamp })
.update({ status: row.raw_status, error_message: nextMessage });
}
return updated;
}
async updateMonitoringData(
+7 -4
View File
@@ -140,10 +140,13 @@ const addWorker = () => {
});
realtimeData[ts].status = resolved.status;
if (resolved.pendingHold) {
// Hold the confirmed side for display, but keep the real measured latency — zeroing it
// would lose data and dent the latency chart during every grace window. Only the error
// text is dropped so a held row never shows a status-contradicting failure message.
delete realtimeData[ts].error_message;
// Hold the confirmed side for display, but PRESERVE the observed latency and error text —
// no diagnostic info is discarded. Tag the row to record that the status is being held
// during the grace period; on confirmation the backfill appends the confirmation note (#756).
const observedError = realtimeData[ts].error_message;
realtimeData[ts].error_message = observedError
? `${observedError} | Status held during grace period`
: "Status held during grace period";
}
}
}