mirror of
https://github.com/rajnandan1/kener.git
synced 2026-06-23 04:10:22 +00:00
feat(confirmation): preserve observed error on held rows; append confirmation note on backfill (#756)
Held (pending) rows now keep the real error text tagged '| Status held during grace period' instead of dropping it, so no diagnostic info is lost. On confirmation the backfill appends '| Down confirmed after N consecutive checks' to the existing text (pipe-separated) rather than overwriting it; recovery clears the error. Append is per-row for cross-DB safety and idempotent on replay. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -8,6 +8,8 @@ Damping happens at **write time**, not read time. The rejected alternative — r
|
||||
|
||||
The flip is **retroactive**, not forward-only. Forward-only (first N−1 failing samples stay `UP` forever, `DOWN` begins at confirmation) is simpler — history would stay append-only — but it systematically shaves N−1 minutes off the front of every real outage, permanently flattering uptime and misreporting outage start times. Retroactive backfill rewrites a bounded window (at most threshold−1 rows, one `UPDATE`) at the moment of crossing; it also makes the threshold compose with the alert Failure Threshold by `max` rather than `+`, because once backfilled the alert lookback sees N consecutive confirmed rows immediately. The cost accepted: `monitoring_data` is no longer append-only over the pending window, so the last-value cache and any reader that observed those rows mid-window saw the pre-flip side until convergence.
|
||||
|
||||
A held (pending) row keeps the check's real measured latency and observed error text — no diagnostic information is discarded — tagged with `| Status held during grace period`. On confirmation the backfill preserves that text and appends `| Down confirmed after N consecutive checks` (rather than overwriting it); a confirmed recovery clears the error, since the row becomes the UP side. `raw_status` remains the canonical record of what each check observed.
|
||||
|
||||
Deliberately accepted boundaries, chosen for one uniform rule rather than special cases:
|
||||
|
||||
- **Count, not minutes.** The threshold is consecutive observations, well-defined for any cron, matching the alerting thresholds' unit; for the common every-minute cron, count and minutes coincide.
|
||||
|
||||
@@ -362,14 +362,45 @@ export class MonitoringRepository extends BaseRepository {
|
||||
message: string | null,
|
||||
): Promise<number> {
|
||||
if (timestamps.length === 0) return 0;
|
||||
return await this.knex("monitoring_data")
|
||||
|
||||
// Recovery (confirmed UP): rows become the UP side — clear any held error text in one update.
|
||||
if (message === null) {
|
||||
return await this.knex("monitoring_data")
|
||||
.where("monitor_tag", monitor_tag)
|
||||
.whereIn("timestamp", timestamps)
|
||||
.whereNotNull("raw_status")
|
||||
.update({
|
||||
status: this.knex.ref("raw_status"),
|
||||
error_message: null,
|
||||
});
|
||||
}
|
||||
|
||||
// Confirmed unhealthy: set each row's status from its observed raw_status and APPEND the
|
||||
// confirmation note to the existing error text (preserving the observed failure reason).
|
||||
// Done per-row for portable string concatenation (|| vs CONCAT differ across SQLite/PG/MySQL)
|
||||
// and to stay idempotent if the backfill is ever replayed.
|
||||
const rows = await this.knex("monitoring_data")
|
||||
.select("timestamp", "error_message", "raw_status")
|
||||
.where("monitor_tag", monitor_tag)
|
||||
.whereIn("timestamp", timestamps)
|
||||
.whereNotNull("raw_status")
|
||||
.update({
|
||||
status: this.knex.ref("raw_status"),
|
||||
error_message: message,
|
||||
});
|
||||
.whereNotNull("raw_status");
|
||||
|
||||
let updated = 0;
|
||||
for (const row of rows) {
|
||||
const existing: string | null = row.error_message;
|
||||
let nextMessage: string;
|
||||
if (!existing) {
|
||||
nextMessage = message;
|
||||
} else if (existing.indexOf(message) !== -1) {
|
||||
nextMessage = existing; // already appended — keep idempotent
|
||||
} else {
|
||||
nextMessage = `${existing} | ${message}`;
|
||||
}
|
||||
updated += await this.knex("monitoring_data")
|
||||
.where({ monitor_tag, timestamp: row.timestamp })
|
||||
.update({ status: row.raw_status, error_message: nextMessage });
|
||||
}
|
||||
return updated;
|
||||
}
|
||||
|
||||
async updateMonitoringData(
|
||||
|
||||
@@ -140,10 +140,13 @@ const addWorker = () => {
|
||||
});
|
||||
realtimeData[ts].status = resolved.status;
|
||||
if (resolved.pendingHold) {
|
||||
// Hold the confirmed side for display, but keep the real measured latency — zeroing it
|
||||
// would lose data and dent the latency chart during every grace window. Only the error
|
||||
// text is dropped so a held row never shows a status-contradicting failure message.
|
||||
delete realtimeData[ts].error_message;
|
||||
// Hold the confirmed side for display, but PRESERVE the observed latency and error text —
|
||||
// no diagnostic info is discarded. Tag the row to record that the status is being held
|
||||
// during the grace period; on confirmation the backfill appends the confirmation note (#756).
|
||||
const observedError = realtimeData[ts].error_message;
|
||||
realtimeData[ts].error_message = observedError
|
||||
? `${observedError} | Status held during grace period`
|
||||
: "Status held during grace period";
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user