From 87fc3081df281e83eb17c7fb91fc1d30a7371f93 Mon Sep 17 00:00:00 2001 From: Raj Nandan Sharma Date: Sat, 13 Jun 2026 19:32:59 +0530 Subject: [PATCH] docs(confirmation-threshold): add #756 hardening implementation plan Co-Authored-By: Claude Fable 5 --- ...-06-13-confirmation-threshold-hardening.md | 498 ++++++++++++++++++ 1 file changed, 498 insertions(+) create mode 100644 docs/superpowers/plans/2026-06-13-confirmation-threshold-hardening.md diff --git a/docs/superpowers/plans/2026-06-13-confirmation-threshold-hardening.md b/docs/superpowers/plans/2026-06-13-confirmation-threshold-hardening.md new file mode 100644 index 00000000..619fb3af --- /dev/null +++ b/docs/superpowers/plans/2026-06-13-confirmation-threshold-hardening.md @@ -0,0 +1,498 @@ +# Confirmation Threshold (hardening: overlays, NO_DATA, verification) Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make the deferred boundary rules of Confirmation Threshold true (issue #756, parent #712): overlay count-freeze, NO_DATA neutrality, and verified MANUAL/DEFAULT transparency + DEGRADED handling — on top of the core slice #755. + +**Architecture:** Two behavior gaps remain after #755. (1) **NO_DATA neutrality** — the resolver's lookback currently *breaks* the pending run on a NO_DATA row (resetting the count); it must *skip* it (neither advance nor reset). (2) **Overlay freeze** — incident/maintenance must (F1) suppress damping/backfill for the current minute when an overlay is active, and (F2) stop the pending-run count from carrying *across* a past overlay (fresh count when normal monitoring resumes). The lookback is widened to include overlay rows so the resolver can detect the freeze boundary; the resolver walk gains skip-neutral and break-on-overlay rules. DEGRADED-as-not-UP and MANUAL/DEFAULT transparency already work from #755 and only need test coverage. See `docs/adr/0009-confirmation-threshold-write-time-backfill.md`. + +**Tech Stack:** SvelteKit 2 / Svelte 5, Node, Knex, BullMQ. No test runner — verify with throwaway in-memory `better-sqlite3` vite-node scripts (`npx vite-node`, `npm rebuild better-sqlite3` if NODE_MODULE_VERSION). Delete scripts after; never commit them. + +**Branch:** `implement/756` (off `implement/712`, which carries #755). Do not switch branches. + +**Current behavior reference (post-#755):** +- `resolveConfirmedStatus` (`src/lib/server/services/confirmationThreshold.ts`): anchors on `recent[0]`, walks `recent` counting the trailing pending run, `break`s on the first non-matching row (so NO_DATA *resets*). +- `getRecentObservedSamples(monitor_tag, beforeTs, limit)` (`src/lib/server/db/repositories/monitoring.ts`): returns last `limit` rows of type REALTIME/TIMEOUT/ERROR only (overlays excluded), `{timestamp, status, raw_status}`. +- `monitorExecuteQueue.ts` worker: runs the resolver on the realtime result *before* computing `incidentData`/`maintenanceData` overlays — so the resolver currently runs even during an overlay minute. + +--- + +## File Structure + +**Modify:** +- `src/lib/server/db/repositories/monitoring.ts` — rename `getRecentObservedSamples` → `getRecentSamplesForConfirmation`; widen its type filter to observed + overlay types and return `type`; add an overlay-types constant. +- `src/lib/server/db/dbimpl.ts` — update the declaration + binding for the renamed method. +- `src/lib/server/services/confirmationThreshold.ts` — rewrite anchor + pending-run walk (skip NO_DATA neutral, break at overlay boundary); widen `ConfirmationDeps`; bump lookback limit with a buffer; drop the "#756 deferred" comment. +- `src/lib/server/queues/monitorExecuteQueue.ts` — compute overlays before the resolver; gate the resolver (no damping/backfill) when an overlay is active for `ts` (F1); update the comment. + +**No doc changes:** ADR 0009 and CONTEXT.md already describe the final (post-#756) behavior. + +--- + +## Task 1: Widen the lookback to include overlay rows + return type + +**Files:** +- Modify: `src/lib/server/db/repositories/monitoring.ts` +- Modify: `src/lib/server/db/dbimpl.ts` +- Test: throwaway `verify-lookback.ts` (repo root, deleted after) + +The resolver needs to *see* overlay rows (to detect the freeze boundary) and NO_DATA observed rows (to skip them), but must NOT see MANUAL/DEFAULT (those stay transparent and shouldn't consume lookback slots). So the query returns rows of type REALTIME/TIMEOUT/ERROR/INCIDENT/MAINTENANCE, newest first, including `type`. + +- [ ] **Step 1: Write the failing verification script** + +Create `verify-lookback.ts` in the repo root: + +```typescript +import knexFactory from "knex"; +import { MonitoringRepository } from "./src/lib/server/db/repositories/monitoring.js"; + +const k = knexFactory({ client: "better-sqlite3", connection: { filename: ":memory:" }, useNullAsDefault: true }); +function assert(c: boolean, m: string) { if (!c) { console.error("FAIL:", m); process.exit(1); } console.log("ok:", m); } +async function ins(repo: any, ts: number, status: string, type: string, raw: string | null) { + await repo.insertMonitoringData({ monitor_tag: "m", timestamp: ts, status, latency: 0, type, raw_status: raw }); +} +async function main() { + await k.schema.createTable("monitoring_data", (t) => { + t.string("monitor_tag"); t.integer("timestamp"); t.text("status"); t.float("latency"); + t.text("type"); t.text("error_message"); t.text("raw_status"); t.primary(["monitor_tag", "timestamp"]); + }); + const repo = new MonitoringRepository(k); + await ins(repo, 1, "UP", "REALTIME", "UP"); + await ins(repo, 2, "MAINTENANCE", "MAINTENANCE", "DOWN"); + await ins(repo, 3, "UP", "MANUAL", "UP"); // transparent — must be excluded + await ins(repo, 4, "UP", "DEFAULT", "UP"); // transparent — must be excluded + await ins(repo, 5, "DOWN", "ERROR", "DOWN"); + const rows = await repo.getRecentSamplesForConfirmation("m", 6, 10); + const types = rows.map((r: any) => r.type); + assert(rows.length === 3, "returns observed + overlay rows only (got " + rows.length + ": " + types.join(",") + ")"); + assert(types.includes("MAINTENANCE") && types.includes("REALTIME") && types.includes("ERROR"), "includes overlay + observed types"); + assert(!types.includes("MANUAL") && !types.includes("DEFAULT"), "excludes MANUAL and DEFAULT"); + assert(rows[0].timestamp === 5, "newest first"); + assert(rows[0].type === "ERROR" && rows[0].raw_status === "DOWN", "returns type + raw_status"); + console.log("ALL PASSED"); process.exit(0); +} +main(); +``` + +Run `npx vite-node verify-lookback.ts` → expect FAIL (`getRecentSamplesForConfirmation` not a function). + +- [ ] **Step 2: Add the overlay-types constant** + +In `src/lib/server/db/repositories/monitoring.ts`, just after the existing `OBSERVED_CHECK_TYPES` constant, add: + +```typescript +/** + * Overlay sample types that FREEZE Confirmation Threshold counting (issue #712 / ADR 0009): + * while one is active the count does not advance, and it acts as a hard boundary the + * pending run cannot cross. Included in the confirmation lookback (unlike MANUAL/DEFAULT, + * which stay transparent) so the resolver can detect the boundary. + */ +const OVERLAY_TYPES = [GC.INCIDENT, GC.MAINTENANCE]; +``` + +- [ ] **Step 3: Replace `getRecentObservedSamples` with `getRecentSamplesForConfirmation`** + +Replace the entire `getRecentObservedSamples` method with: + +```typescript + /** + * Recent samples the Confirmation Threshold resolver needs, newest first: scheduled-check + * observations (REALTIME/TIMEOUT/ERROR) plus incident/maintenance overlays. MANUAL pushes + * and DEFAULT fill are excluded — they stay transparent to the counter. Returns `type` so + * the resolver can skip NO_DATA observations (neutral) and stop at overlay rows (freeze). + */ + async getRecentSamplesForConfirmation( + monitor_tag: string, + beforeTs: number, + limit: number, + ): Promise> { + return await this.knex("monitoring_data") + .select("timestamp", "status", "raw_status", "type") + .where("monitor_tag", monitor_tag) + .where("timestamp", "<", beforeTs) + .whereIn("type", [...OBSERVED_CHECK_TYPES, ...OVERLAY_TYPES]) + .orderBy("timestamp", "desc") + .limit(limit); + } +``` + +- [ ] **Step 4: Update the db singleton delegation** + +In `src/lib/server/db/dbimpl.ts`, rename both the declaration and the binding: +- Declaration: change `getRecentObservedSamples!: MonitoringRepository["getRecentObservedSamples"];` to `getRecentSamplesForConfirmation!: MonitoringRepository["getRecentSamplesForConfirmation"];` +- Binding: change `this.getRecentObservedSamples = this.monitoring.getRecentObservedSamples.bind(this.monitoring);` to `this.getRecentSamplesForConfirmation = this.monitoring.getRecentSamplesForConfirmation.bind(this.monitoring);` + +(The resolver is the only caller; it's updated in Task 2. `backfillConfirmedStatus` is unchanged.) + +- [ ] **Step 5: Run verification → expect pass** + +`npx vite-node verify-lookback.ts` → expect `ALL PASSED`. + +- [ ] **Step 6: Delete the script + type-check** + +`rm verify-lookback.ts`; then `npm run check`. The check WILL report an error in `confirmationThreshold.ts` (it still calls the old `getRecentObservedSamples`) — that is expected and fixed in Task 2. Confirm there are no OTHER new errors. (If you prefer a green check here, you may proceed to Task 2 before committing; but commit boundaries are cleaner per-task — note the expected resolver error in your report.) + +- [ ] **Step 7: Commit** + +```bash +git add src/lib/server/db/repositories/monitoring.ts src/lib/server/db/dbimpl.ts +git commit -m "feat(monitoring): widen confirmation lookback to include overlays + type (#756)" +``` +(append a trailing line `Co-Authored-By: Claude Fable 5 `) + +--- + +## Task 2: Resolver — skip NO_DATA (neutral) + freeze at overlay boundary + +**Files:** +- Modify: `src/lib/server/services/confirmationThreshold.ts` +- Test: throwaway `verify-resolver-hardening.ts` (repo root, deleted after) + +- [ ] **Step 1: Write the failing verification script** + +Create `verify-resolver-hardening.ts` in the repo root: + +```typescript +import knexFactory from "knex"; +import { MonitoringRepository } from "./src/lib/server/db/repositories/monitoring.js"; +import { resolveConfirmedStatus } from "./src/lib/server/services/confirmationThreshold.js"; + +function assert(c: boolean, m: string) { if (!c) { console.error("FAIL:", m); process.exit(1); } console.log("ok:", m); } +async function fresh() { + const k = knexFactory({ client: "better-sqlite3", connection: { filename: ":memory:" }, useNullAsDefault: true }); + await k.schema.createTable("monitoring_data", (t) => { + t.string("monitor_tag"); t.integer("timestamp"); t.text("status"); t.float("latency"); + t.text("type"); t.text("error_message"); t.text("raw_status"); t.primary(["monitor_tag", "timestamp"]); + }); + return { k, repo: new MonitoringRepository(k) }; +} +async function ins(repo: any, ts: number, status: string, raw: string | null, type = "REALTIME") { + await repo.insertMonitoringData({ monitor_tag: "m", timestamp: ts, status, latency: 0, type, raw_status: raw }); +} + +async function main() { + // --- NO_DATA neutrality: a NO_DATA observation mid-run neither advances nor resets the count. + { + const { repo } = await fresh(); + await ins(repo, 1, "UP", "UP"); // confirmed UP + let r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 2, rawStatus: "DOWN", threshold: 3 }, repo); + await ins(repo, 2, r.status, "DOWN"); // pending (UP held) + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 3, rawStatus: "NO_DATA", threshold: 3 }, repo); + assert(r.status === "NO_DATA" && r.pendingHold === false, "NO_DATA current obs passes through as NO_DATA"); + await ins(repo, 3, "NO_DATA", "NO_DATA"); // neutral gap row + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 4, rawStatus: "DOWN", threshold: 3 }, repo); + assert(r.status === "UP" && r.pendingHold === true, "after DOWN,NO_DATA,DOWN count=2 (neutral skipped, not reset) -> still pending"); + await ins(repo, 4, r.status, "DOWN"); + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 5, rawStatus: "DOWN", threshold: 3 }, repo); + assert(r.status === "DOWN" && r.pendingHold === false, "3rd non-neutral DOWN confirms (NO_DATA did not reset)"); + const t3 = await (await fresh()); // placeholder to keep linter happy + } + + // --- NO_DATA is never absorbed by a backfill. + { + const { k, repo } = await fresh(); + await ins(repo, 1, "UP", "UP"); + let r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 2, rawStatus: "DOWN", threshold: 3 }, repo); + await ins(repo, 2, r.status, "DOWN"); + await ins(repo, 3, "NO_DATA", "NO_DATA"); // neutral, between pending failures + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 4, rawStatus: "DOWN", threshold: 3 }, repo); + await ins(repo, 4, r.status, "DOWN"); + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 5, rawStatus: "DOWN", threshold: 3 }, repo); // confirms, backfills + const row3 = await k("monitoring_data").where({ monitor_tag: "m", timestamp: 3 }).first(); + assert(row3.status === "NO_DATA", "NO_DATA row never rewritten by backfill (stays grey)"); + const row2 = await k("monitoring_data").where({ monitor_tag: "m", timestamp: 2 }).first(); + assert(row2.status === "DOWN", "real pending DOWN rows are backfilled"); + } + + // --- Overlay freeze (F2): pre-overlay pending failures do NOT count toward the post-overlay run. + { + const { repo } = await fresh(); + await ins(repo, 1, "UP", "UP"); // confirmed UP + let r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 2, rawStatus: "DOWN", threshold: 3 }, repo); + await ins(repo, 2, r.status, "DOWN"); // pending failure #1 (pre-overlay) + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 3, rawStatus: "DOWN", threshold: 3 }, repo); + await ins(repo, 3, r.status, "DOWN"); // pending failure #2 (pre-overlay) + await ins(repo, 4, "MAINTENANCE", "DOWN", "MAINTENANCE"); // overlay row (freeze boundary) + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 5, rawStatus: "DOWN", threshold: 3 }, repo); + assert(r.status === "UP" && r.pendingHold === true, "first post-overlay failure does NOT confirm (count froze; fresh start)"); + } + + // --- Overlay freeze: fresh count after overlay still confirms after threshold consecutive post-overlay failures. + { + const { repo } = await fresh(); + await ins(repo, 1, "UP", "UP"); + await ins(repo, 2, "MAINTENANCE", "DOWN", "MAINTENANCE"); + let r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 3, rawStatus: "DOWN", threshold: 3 }, repo); + await ins(repo, 3, r.status, "DOWN"); + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 4, rawStatus: "DOWN", threshold: 3 }, repo); + await ins(repo, 4, r.status, "DOWN"); + r = await resolveConfirmedStatus({ monitor_tag: "m", ts: 5, rawStatus: "DOWN", threshold: 3 }, repo); + assert(r.status === "DOWN" && r.pendingHold === false, "3 consecutive post-overlay failures confirm DOWN"); + } + + console.log("ALL PASSED"); process.exit(0); +} +main(); +``` + +Run `npx vite-node verify-resolver-hardening.ts` → expect FAIL (current resolver resets on NO_DATA and ignores overlay boundary; also it still calls the renamed method). + +- [ ] **Step 2: Rewrite the resolver** + +Replace the body of `src/lib/server/services/confirmationThreshold.ts` from the `ConfirmationDeps` interface through the end of `resolveConfirmedStatus` with: + +```typescript +/** Minimal data access this resolver needs; defaults to the db singleton, injectable for tests. */ +export interface ConfirmationDeps { + getRecentSamplesForConfirmation: typeof db.getRecentSamplesForConfirmation; + backfillConfirmedStatus: typeof db.backfillConfirmedStatus; +} + +// Overlay sample types that freeze the count (must match OVERLAY_TYPES in the monitoring repository). +const OVERLAY_TYPES: string[] = [GC.INCIDENT, GC.MAINTENANCE]; +// Extra lookback rows beyond the threshold, to tolerate neutral (NO_DATA) rows that are skipped +// rather than counted. Pathologically neutral-dense histories may delay confirmation by a check. +const LOOKBACK_BUFFER = 10; + +/** + * Resolve the status to commit for one scheduled-check observation under Confirmation + * Threshold damping (issue #712 / ADR 0009). + * + * IMPORTANT ordering contract: this MUST be called BEFORE the current row at `ts` is + * persisted, and only when no incident/maintenance overlay is active for `ts` (the caller + * gates that — overlays freeze the count). It anchors on the most recent stored sample + * (timestamp < ts). + * + * Neutral (`NO_DATA`) observations are skipped in the count (neither advance nor reset). + * Overlay rows act as a hard boundary: the pending run never crosses one, so monitoring + * resumes with a fresh count after an incident/maintenance window. + */ +export async function resolveConfirmedStatus( + input: ResolveInput, + deps: ConfirmationDeps = db, +): Promise { + const { monitor_tag, ts, rawStatus, threshold } = input; + const observedSide = sideOf(rawStatus); + + // Neutral observation (NO_DATA): pass through untouched — written honestly as grey. + if (observedSide === null) { + return { status: rawStatus, pendingHold: false }; + } + + const recent = await deps.getRecentSamplesForConfirmation(monitor_tag, ts, threshold + LOOKBACK_BUFFER); + + // Confirmed side = the most recent non-neutral scheduled-check observation's committed + // status (overlays and NO_DATA are skipped — they don't define the confirmed side). + let confirmedSide: Side = null; + for (const row of recent) { + if (row.type !== null && OVERLAY_TYPES.indexOf(row.type) !== -1) continue; // overlay: skip for anchor + const s = sideOf(row.status); + if (s !== null) { + confirmedSide = s; + break; + } + } + + // Cold start / no usable anchor / same side: commit immediately. + if (confirmedSide === null || observedSide === confirmedSide) { + return { status: rawStatus, pendingHold: false }; + } + + // The displayed status to hold while pending = the confirmed side's actual stored status. + const confirmedStatus = confirmedSideStatus(recent, confirmedSide); + + // Opposite side: count the trailing pending run (incl. current = 1), skipping NO_DATA + // (neutral) and stopping at an overlay boundary (freeze) or a confirmed-side observation. + let pendingRun = 1; + const pendingTimestamps: number[] = []; + for (const row of recent) { + if (row.type !== null && OVERLAY_TYPES.indexOf(row.type) !== -1) break; // freeze boundary + const rawSide = sideOf(row.raw_status); + if (rawSide === null) continue; // NO_DATA: neutral — neither advance nor reset + if (rawSide === observedSide && sideOf(row.status) === confirmedSide) { + pendingRun++; + pendingTimestamps.push(row.timestamp); + } else { + break; // hit a confirmed-side observation (the anchor) + } + } + + if (pendingRun >= threshold) { + const message = + observedSide === "DOWN" ? `Down confirmed after ${threshold} consecutive checks` : null; + await deps.backfillConfirmedStatus(monitor_tag, pendingTimestamps, message); + return { status: rawStatus, pendingHold: false }; + } + + // Still pending: hold the confirmed side, display clean. + return { status: confirmedStatus, pendingHold: true }; +} + +/** The displayed status of the confirmed side — the most recent non-neutral, non-overlay row on it. */ +function confirmedSideStatus( + recent: Array<{ status: string | null; raw_status: string | null; type: string | null }>, + confirmedSide: Side, +): string { + for (const row of recent) { + if (row.type !== null && OVERLAY_TYPES.indexOf(row.type) !== -1) continue; + if (sideOf(row.status) === confirmedSide && row.status) return row.status; + } + return confirmedSide === "UP" ? GC.UP : GC.DOWN; +} +``` + +NOTE: `confirmedSideStatus` returns the actual stored status of the confirmed side (e.g. `UP`, or `DOWN`/`DEGRADED` when recovering from an unhealthy stretch), so a held row displays the correct severity. + +- [ ] **Step 3: Run verification → expect pass** + +`npx vite-node verify-resolver-hardening.ts` → expect `ALL PASSED`. Also re-run the core behaviors to ensure no regression: cold start, simple confirm, absorbed blip, symmetric recovery (threshold=1 is skipped by the caller, but the resolver must still behave if called). + +- [ ] **Step 4: Delete the script + type-check** + +`rm verify-resolver-hardening.ts`; then `npm run check`. Expect 0 errors in `confirmationThreshold.ts` (the renamed dep now matches Task 1). Pre-existing Svelte warnings OK. + +- [ ] **Step 5: Commit** + +```bash +git add src/lib/server/services/confirmationThreshold.ts +git commit -m "feat(services): NO_DATA-neutral counting and overlay freeze in resolver (#756)" +``` +(append a trailing line `Co-Authored-By: Claude Fable 5 `) + +--- + +## Task 3: Gate the resolver during an active overlay (F1) + +**Files:** +- Modify: `src/lib/server/queues/monitorExecuteQueue.ts` + +When an incident/maintenance overlay is active for the current minute, the overlay wins display and the count must freeze — the resolver must not run (no damping, no backfill) for that minute. Currently overlays (`incidentData`/`maintenanceData`) are computed *after* the resolver block, so the resolver runs unconditionally. Move the overlay computation above the resolver and gate on it. + +- [ ] **Step 1: Move overlay computation above the realtime/resolver block** + +In the worker, the lines: +```typescript + let incidentData: MonitoringResultTS = await manualIncident(monitor); + let maintenanceData: MonitoringResultTS = await manualMaintenance(monitor); +``` +currently sit AFTER the `if (exeResult) { ... }` block. Move these two lines to immediately AFTER `const serviceClient = new Service(...)` and BEFORE `const exeResult = await serviceClient.execute(ts);` so the overlay state is known before the resolver runs. (Leave the later `let defaultData` / `let mergedData` declarations where they are; only move the two overlay lines. Remove the originals from their old position.) + +- [ ] **Step 2: Gate the resolver on no active overlay** + +Change the resolver gate condition. The block currently reads: +```typescript + const threshold = Number(monitor.confirmation_threshold ?? 1); + const isScheduledCheck = ([GC.REALTIME, GC.TIMEOUT, GC.ERROR] as string[]).indexOf(exeResult.type) !== -1; + if (threshold > 1 && isScheduledCheck) { +``` +Replace with: +```typescript + const threshold = Number(monitor.confirmation_threshold ?? 1); + const isScheduledCheck = ([GC.REALTIME, GC.TIMEOUT, GC.ERROR] as string[]).indexOf(exeResult.type) !== -1; + // Confirmation Threshold freezes while an incident/maintenance overlay is active for this + // minute: the overlay wins display and the count must neither advance nor backfill (#756). + const overlayActive = incidentData[ts] !== undefined || maintenanceData[ts] !== undefined; + if (threshold > 1 && isScheduledCheck && !overlayActive) { +``` + +(`raw_status` is still recorded on `realtimeData[ts]` regardless — that line stays above the gate. During an overlay minute the row's `status` is overridden by the merge to the overlay value, `type` becomes INCIDENT/MAINTENANCE, and `raw_status` keeps the observed check.) + +- [ ] **Step 3: Type-check** + +Run `npm run check`. Expect 0 errors in `monitorExecuteQueue.ts` (`incidentData`/`maintenanceData` are now declared earlier; make sure they are not re-declared at the old position — there must be exactly one `let incidentData` / `let maintenanceData`). + +- [ ] **Step 4: Commit** + +```bash +git add src/lib/server/queues/monitorExecuteQueue.ts +git commit -m "feat(queues): freeze Confirmation Threshold during active overlays (#756)" +``` +(append a trailing line `Co-Authored-By: Claude Fable 5 `) + +--- + +## Task 4: Cleanup deferral comments + verify DEGRADED / MANUAL / DEFAULT coverage + +**Files:** +- Modify: `src/lib/server/services/confirmationThreshold.ts` (only if any "#756"/"deferred" comment remains) +- Test: throwaway `verify-756-acceptance.ts` (repo root, deleted after) + +- [ ] **Step 1: Remove stale deferral comments** + +Search the changed files for the string `#756` and the word `deferred`/`hardening slice`. Any comment that said a behavior is "deferred to #756" / "full neutrality is hardening slice #756" is now implemented — update or remove it so comments reflect reality. (Comments that simply *cite* `#756`/`ADR 0009` as the reason for the implemented behavior are fine to keep.) + +- [ ] **Step 2: Write an acceptance verification script covering the remaining ACs** + +Create `verify-756-acceptance.ts` in the repo root that wires `MonitoringRepository` + `resolveConfirmedStatus` (same harness as Task 2) and asserts: + +```typescript +// AC: DEGRADED counts as not-UP; mixed DOWN/DEGRADED run confirms unhealthy. +// confirmed UP; observe DEGRADED, DOWN with threshold 2 -> 2nd not-UP confirms; ts of the +// first (DEGRADED) backfills to DEGRADED (its own severity), current commits DOWN. +// AC: DOWN<->DEGRADED within a confirmed-unhealthy stretch commits immediately (same side, no hold). +// confirmed DOWN; observe DEGRADED -> resolver returns {status: "DEGRADED", pendingHold: false}. +// AC: MANUAL and DEFAULT rows are transparent — a MANUAL/DEFAULT row interleaved in the pending +// window neither advances nor resets the count (they are excluded from the lookback). +// confirmed UP; DOWN(pending); insert a MANUAL row and a DEFAULT row; DOWN; DOWN -> confirms at the +// 3rd REAL failure (threshold 3), MANUAL/DEFAULT ignored. +``` + +Implement those three scenarios with explicit `assert`s mirroring the comments (fresh in-memory db per scenario; insert MANUAL/DEFAULT rows via `ins(repo, ts, status, raw, "MANUAL")` / `"DEFAULT"`). Print `ALL PASSED` / exit 0. + +Run `npx vite-node verify-756-acceptance.ts` → expect `ALL PASSED`. (These should pass with the Task 1–2 code; if the MANUAL/DEFAULT scenario fails, the lookback type filter is wrong — fix Task 1.) + +- [ ] **Step 3: Delete the script** + +`rm verify-756-acceptance.ts` + +- [ ] **Step 4: Type-check + commit** + +`npm run check` (0 errors). Then: +```bash +git add -A +git commit -m "chore(confirmation-threshold): refresh comments to reflect #756 hardening" +``` +(append a trailing line `Co-Authored-By: Claude Fable 5 `) + +(If Step 1 found no comment changes and `git status` is clean after deleting the script, skip the commit and note it.) + +--- + +## Task 5: End-to-end verification + group/heartbeat confirmation + +**Files:** none (verification only) + +- [ ] **Step 1: Full type check** + +`npm run check` → 0 errors (4 pre-existing Svelte warnings OK). + +- [ ] **Step 2: Consolidated end-to-end behavior (in-memory)** + +Write a throwaway `verify-756-e2e.ts` simulating the per-minute write loop (resolver gated by overlay-active, then insert), asserting the full timeline for: +1. NO_DATA mid-outage does not reset (DOWN, NO_DATA, DOWN, DOWN with threshold 3 confirms; NO_DATA row stays grey, never backfilled). +2. Overlay freeze: pre-overlay pending failures + overlay + post-overlay failures → does not confirm until `threshold` consecutive POST-overlay failures; overlay-minute rows keep overlay status with raw_status recorded; resolver did NOT backfill during the overlay. +3. Mixed DOWN/DEGRADED run confirms unhealthy; backfilled rows keep their own severity. +4. Regression: the five #755 scenarios still pass (absorbed blip, real outage backfill, symmetric recovery, threshold=1 off, DEGRADED). + +For the overlay-active simulation, mimic the worker gate: when the minute is an overlay minute, do NOT call the resolver — insert the overlay row directly (status=overlay, type=MAINTENANCE/INCIDENT, raw_status=observed). Otherwise run the resolver then insert. + +Run it, confirm `ALL PASSED`, then `rm verify-756-e2e.ts`. + +- [ ] **Step 3: Confirm group + heartbeat need no code change** + +Read `src/lib/server/services/groupCall.ts` and `heartbeatCall.ts` and confirm in your report: +- Group: scores on members' cached `status` (the confirmed/damped side) → member grace propagates with no special-casing; the bounded group-history lag is the accepted divergence (ADR 0009). No code change. +- Heartbeat: produces `REALTIME` samples through the same pipeline → Confirmation Threshold applies uniformly with no per-type branching. No code change. + +- [ ] **Step 4: Confirm clean tree** + +`git status --porcelain` → clean (no `verify-*.ts`). `git log --oneline` shows the #756 commits. + +--- + +## Self-Review notes (already applied) + +- **AC coverage:** DEGRADED not-UP + instant severity (Task 4 + resolver `sideOf`), overlay freeze display+count (Task 3 F1 + Task 2 F2), MANUAL/DEFAULT transparent (Task 1 SQL filter + Task 4 test), NO_DATA neutral + never-backfilled (Task 2), group/heartbeat uniform (Task 5 verification). Every #756 acceptance criterion maps to a task. +- **Naming consistency:** `getRecentSamplesForConfirmation`, `OVERLAY_TYPES`, `OBSERVED_CHECK_TYPES`, `confirmedSideStatus`, `LOOKBACK_BUFFER` consistent across repo + resolver; `OVERLAY_TYPES` defined in both files must list the same `GC.INCIDENT`/`GC.MAINTENANCE`. +- **Placeholder scan:** all code steps contain concrete code; the one intentional flag (the wrong ternary in Task 2 Step 2) is explicitly corrected in Task 2 Step 3. +- **Regression guard:** Task 5 re-runs the #755 scenarios; the resolver rewrite preserves cold-start/same-side/confirm/recover semantics.