From e9d3281067f51962d551bd7df470c4bf124717fb Mon Sep 17 00:00:00 2001 From: Raj Nandan Sharma Date: Sat, 13 Jun 2026 20:29:46 +0530 Subject: [PATCH] chore: gitignore and untrack the whole docs/adr/ folder Broaden the ignore from the single 0009 ADR to the entire docs/adr/ directory and untrack the existing ADRs (0001-0008); files are kept on disk and remain in history. Co-Authored-By: Claude Fable 5 --- .gitignore | 2 +- docs/adr/0001-explicit-group-membership.md | 5 ----- .../0002-maintenance-page-keyed-by-event-id.md | 5 ----- docs/adr/0003-fail-fast-self-healing-db-pool.md | 9 --------- docs/adr/0004-home-page-api-token.md | 9 --------- .../0005-alerts-evaluate-alert-visible-samples.md | 9 --------- .../0006-manual-maintenance-event-transitions.md | 15 --------------- docs/adr/0007-problem-first-overall-status.md | 7 ------- .../adr/0008-explicit-deletes-over-fk-cascades.md | 9 --------- 9 files changed, 1 insertion(+), 69 deletions(-) delete mode 100644 docs/adr/0001-explicit-group-membership.md delete mode 100644 docs/adr/0002-maintenance-page-keyed-by-event-id.md delete mode 100644 docs/adr/0003-fail-fast-self-healing-db-pool.md delete mode 100644 docs/adr/0004-home-page-api-token.md delete mode 100644 docs/adr/0005-alerts-evaluate-alert-visible-samples.md delete mode 100644 docs/adr/0006-manual-maintenance-event-transitions.md delete mode 100644 docs/adr/0007-problem-first-overall-status.md delete mode 100644 docs/adr/0008-explicit-deletes-over-fk-cascades.md diff --git a/.gitignore b/.gitignore index d73a86c0..a6612d8a 100644 --- a/.gitignore +++ b/.gitignore @@ -38,5 +38,5 @@ translation-report.json # AI workflow docs (not version-controlled) CONTEXT.md -docs/adr/0009-confirmation-threshold-write-time-backfill.md +docs/adr/ docs/superpowers/ \ No newline at end of file diff --git a/docs/adr/0001-explicit-group-membership.md b/docs/adr/0001-explicit-group-membership.md deleted file mode 100644 index 09914f1e..00000000 --- a/docs/adr/0001-explicit-group-membership.md +++ /dev/null @@ -1,5 +0,0 @@ -# Group membership is an explicit stored list, not a rule - -When making group-monitor member selection searchable (#694), the requester also proposed dynamic membership by tag pattern (e.g. `site1-*` auto-adds matching monitors). We decided group membership stays an explicit, stored list of members. Dynamic membership contradicts the group model: each member carries an explicit weight (weights must sum to 1) and a manual execution order — a rule that adds/removes members over time would need an auto-weighting policy, silent weight redistribution when monitors are created or deleted, and an undefined execution order for matched members. Bulk needs are served in the editor instead: search plus "Add all N matching" makes large explicit groups cheap to build. - -If wildcard groups are requested again, the answer is here: it's a different feature (a rule-based aggregate without weights or order), not an extension of Group Monitors. diff --git a/docs/adr/0002-maintenance-page-keyed-by-event-id.md b/docs/adr/0002-maintenance-page-keyed-by-event-id.md deleted file mode 100644 index dccca53c..00000000 --- a/docs/adr/0002-maintenance-page-keyed-by-event-id.md +++ /dev/null @@ -1,5 +0,0 @@ -# Public maintenance page is keyed by Maintenance Event id - -The public route `/maintenances/` interprets `` as a Maintenance Event id by default (with `?type=maintenance` to address a Maintenance definition instead), even though the route param is named `maintenance_id`. Issue #723 showed this misleads API consumers: `/api/v4/maintenances` returns Maintenance ids, and linking those to `/maintenances/` lands on whatever Event happens to carry that id — an apparent "off-by-one title mismatch" with no actual data corruption. - -We considered flipping the route default to Maintenance ids but rejected it: every internal status-page link, subscriber email, and externally bookmarked URL is keyed by Event id, and all of those would break. Instead, v4 API responses carry an absolute `url` field (built from the configured Site URL) that resolves correctly — `?type=maintenance` for Maintenance objects, the plain Event-id path for Maintenance Events. Consumers should link via `url`, never by concatenating ids onto paths. diff --git a/docs/adr/0003-fail-fast-self-healing-db-pool.md b/docs/adr/0003-fail-fast-self-healing-db-pool.md deleted file mode 100644 index 99bdbec8..00000000 --- a/docs/adr/0003-fail-fast-self-healing-db-pool.md +++ /dev/null @@ -1,9 +0,0 @@ -# Fail-fast, self-healing database pool defaults - -`knexfile.ts` overrides knex's pool defaults for network databases (Postgres, MySQL): `pool.min` is 0 instead of 2, acquire/create timeouts are 15s instead of 60s/30s, and TCP keepalive is enabled on connections. All knobs are overridable via `DATABASE_*` env vars. - -Two production incidents drove this. On Railway, a Postgres outage caused every request to hang for knex's default 60s `acquireConnectionTimeout` before failing with `KnexTimeoutError`, and after the database recovered the app stayed broken until a manual restart. In Docker Swarm (#692), the overlay network's conntrack silently dropped idle TCP connections after ~20 minutes, so the first request after an idle period drew a dead socket from the pool and returned a 500; the reporter worked around it with server-side Postgres `tcp_keepalives_*` settings and asked for an application-level fix. - -Both share one root cause: knex keeps `pool.min` connections forever and never validates them. Those permanently-idle sockets are exactly the ones cloud networks (Railway proxies, Swarm overlays, k8s) silently kill, and after any database blip they wedge the pool with corpses. `min: 0` lets the reaper retire every idle connection (`idleTimeoutMillis` 30s, well under typical conntrack windows), keepalive lets the OS detect silently-dropped sockets, and the 15s timeouts turn a minute-long hang into a fast failure during an outage. - -The trade-off: a quiet instance pays connection setup on the first query after idle (tens of milliseconds), and a database that takes longer than 15s to accept connections will see failures where the old defaults would have waited a minute. Deployments with such databases can raise `DATABASE_ACQUIRE_TIMEOUT_MS` / `DATABASE_CREATE_TIMEOUT_MS` rather than the project reverting to defaults that wedge everyone else. diff --git a/docs/adr/0004-home-page-api-token.md b/docs/adr/0004-home-page-api-token.md deleted file mode 100644 index d0b26ecc..00000000 --- a/docs/adr/0004-home-page-api-token.md +++ /dev/null @@ -1,9 +0,0 @@ -# Home page is addressed as `~home` in the v4 API - -The Home Page is stored with an empty `page_path`, which can not appear as a URL segment, so `/api/v4/pages/{page_path}` could not address it at all (#737). The v4 API now accepts the special segment `~home` for it: the request middleware maps `~home` to a lookup of the empty path before handlers run. - -We considered the reporter's suggestion of a `default` keyword, and `home`, but both are valid page paths under the sanitizer (`[a-z0-9_-]`), so a real page could shadow the keyword or force reservation rules and migration edge cases. We also considered addressing pages by id, which either breaks the existing path-based contract or is ambiguous with numeric page paths. `~home` can never collide because the sanitizer strips `~`, and tilde is an RFC 3986 unreserved character, so clients never need to encode it (percent-encoded `%7Ehome` works too, since the middleware decodes segments). - -Semantics follow the server-side invariants the manage UI relies on: `PATCH` via `~home` accepts every field except `page_path`, which is fixed for the home page (the UI disables the field), and `DELETE` is rejected — `DeletePage` in `pagesController` throws "Cannot delete the home page" and the rest of the app assumes the home page exists. - -API responses also render the home page's `page_path` as `~home` (list, single, and write responses), so what a consumer reads is exactly what it can address — list → pick → `PATCH` round-trips cleanly, including read-modify-write bodies that send `~home` back (treated as "no path change"). The stored path remains empty and the public URL remains the site root; consumers must not build public URLs by concatenating `page_path`. diff --git a/docs/adr/0005-alerts-evaluate-alert-visible-samples.md b/docs/adr/0005-alerts-evaluate-alert-visible-samples.md deleted file mode 100644 index 4aaa39d0..00000000 --- a/docs/adr/0005-alerts-evaluate-alert-visible-samples.md +++ /dev/null @@ -1,9 +0,0 @@ -# Alerts evaluate alert-visible samples, not just REALTIME ones - -The consecutive-sample checks behind alert evaluation (`consecutivelyStatusFor`, `consecutivelyLatencyGreaterThan`, `consecutivelyLatencyLessThan` in `src/lib/server/db/repositories/monitoring.ts`) consider samples whose type is `REALTIME`, `ERROR`, `TIMEOUT`, `MANUAL`, or `DEFAULT_STATUS` — the "alert-visible" set — instead of `REALTIME` only. Both data-API PATCH endpoints (single timestamp and range) enqueue one alert evaluation after writing `MANUAL` rows. `SIGNAL` rows and `INCIDENT`/`MAINTENANCE` overlay rows remain invisible to alerting. - -Two issues drove this. In #633, a GameDig monitor showed DOWN on the status page but never alerted: a down game server makes `GameDig.query` throw, so every down-sample is recorded as `ERROR`, which the old `type = REALTIME` filter excluded — the "N consecutive DOWN" condition could never become true. The same failure mode silently broke gRPC, SQL, and SSL monitors (hard-down records `ERROR`) and API monitors whose outage manifests as timeouts (`TIMEOUT`). In #720, a NONE monitor driven by the data API never alerted for two stacked reasons: PATCH writes `MANUAL` rows the filter excluded, and the endpoint never enqueued evaluation at all. The status page and UPTIME alerts have no type filter, which is why users saw DOWN while alerts stayed silent. - -The whitelist is exactly the set of types written by flows that trigger alert evaluation — scheduler checks (`REALTIME`/`ERROR`/`TIMEOUT`), default-status fill (`DEFAULT_STATUS`), and data-API pushes (`MANUAL`). That invariant ("every enqueuer of evaluation contributes a row the evaluator can see") is what keeps the un-time-bounded last-N query self-healing: without `DEFAULT_STATUS` in the set, a NONE monitor with a default status evaluates every minute against rows it cannot see, so stale backfilled `MANUAL` rows would rank as "the last N" and fire alerts weeks after the fact. The rejected alternatives: fixing only `gamedigCall` to emit `REALTIME` on query failure (fixes one monitor type out of five, loses the stored down-vs-errored diagnostic, and does nothing for existing data), and time-bounding the query (the cutoff must scale with each monitor's cron, which means parsing cron expressions in the alert path for marginal benefit). - -The trade-offs, accepted deliberately for one uniform rule across status and latency alerts: a service that degrades from slow to hard-down resolves an active latency alert (error samples carry latency 0, satisfying "consecutively below threshold") — the status alert is the one that covers outages; a NONE monitor with a default status auto-resolves MANUAL-pushed alerts once default fill resumes, because a default status is an explicit statement that absence of pushes means that status; and the fix is retroactive, so monitors that were already down at upgrade time alert shortly after — which is the bug report, inverted. diff --git a/docs/adr/0006-manual-maintenance-event-transitions.md b/docs/adr/0006-manual-maintenance-event-transitions.md deleted file mode 100644 index a1103109..00000000 --- a/docs/adr/0006-manual-maintenance-event-transitions.md +++ /dev/null @@ -1,15 +0,0 @@ -# Manual maintenance-event transitions rewrite the record to what actually happened - -Issue #722 asked for a way to end a maintenance early instead of deleting it or editing its timer. We added manual transitions on Maintenance Events with one governing principle: **the event record reflects what actually happened, not what was planned.** Manually completing or cancelling an ONGOING event truncates `end_date_time` to the moment of the transition; cancelling a not-yet-started event (SCHEDULED/READY) keeps its planned window, since it never affected monitors. We rejected status-only updates (the record would forever claim a 4-hour window for a 1-hour maintenance, and every reader would have to compensate by also checking status). - -Allowed transitions: ONGOING → COMPLETED, and SCHEDULED/READY/ONGOING → CANCELLED. COMPLETED means the work finished; CANCELLED means the occurrence was called off. Both are terminal — no relabeling, no un-cancel. Reverse transitions were rejected because the minute-scheduler's catch-up query would immediately flip a resurrected past-start SCHEDULED event back to ONGOING and fire "in progress" notifications. The escape hatch for a mistaken cancel is editing the Maintenance schedule, which regenerates events. - -Consequences that fall out of existing code rather than new code: - -- Monitor impact stops immediately on transition — the realtime query filters on `status IN (SCHEDULED, READY, ONGOING)`, so no time rewrite is needed for that. -- For recurring Maintenances, cancelling an occurrence permanently skips it: event generation dedups by `start_date_time` against **all** existing events regardless of status, so the CANCELLED row blocks regeneration. This is intended behavior, not an accident — do not "fix" the dedup to ignore CANCELLED rows. -- Public "upcoming"/"ongoing" lists already filter by status, so cancelled events vanish from them while remaining visible (with a CANCELLED badge) on the event detail and events-by-month history pages. - -API shape: the transition rides the existing `PATCH /api/v4/maintenances/{maintenance_id}/events/{event_id}` as a second, mutually exclusive mode — either `start_date_time`+`end_date_time` (window edit, unchanged) or `status` alone (transition). Mixing them is a 400, which avoids deciding who wins when an explicit `end_date_time` contradicts the truncation rule. A dedicated action sub-route was rejected as a break from v4's CRUD style. - -Notifications: every transition to a terminal status — natural or manual, COMPLETED or CANCELLED — notifies subscribers under the existing `ended` toggle. A dedicated `cancelled` toggle was rejected: the settings object is stored whole, so existing sites would silently default the new key to off, and the uniform reading of `ended` ("tell me when an event reaches its end state") makes a separate toggle unnecessary. diff --git a/docs/adr/0007-problem-first-overall-status.md b/docs/adr/0007-problem-first-overall-status.md deleted file mode 100644 index 7b11d297..00000000 --- a/docs/adr/0007-problem-first-overall-status.md +++ /dev/null @@ -1,7 +0,0 @@ -# Overall Status is problem-first: DOWN > DEGRADED > MAINTENANCE > UP - -Everywhere a set of monitor statuses collapses into one display status — the page banner (`GetStatusSummary`/`GetStatusColor` in `src/lib/clientTools.ts`), the per-day bar summaries, and the all-monitors `_` badge and dot badge (`GetLatestStatusActiveAll` in `src/lib/server/controllers/monitorsController.ts`) — the same worst-wins ordering applies: DOWN > DEGRADED > MAINTENANCE > UP, with NO_DATA only when no monitor has any data at all. - -Issue #717 exposed that the codebase had three independent answers to "what does maintenance mean when aggregating". The frontend banner checked maintenance first, so one monitor in a planned window reported "Under Maintenance" even while another monitor was hard DOWN — a real outage masked by planned work. The badge loop had no MAINTENANCE branch at all, so maintenance samples were silently skipped (DEGRADED+UP+MAINTENANCE → "Degraded", disagreeing with the banner) and a fleet entirely under maintenance fell through to "No Status Available". Group Monitor scoring counts maintenance as UP. With the same monitors, the page and the badge told different stories, which breaks any automation treating either as the source of truth. - -Problem-first won because maintenance is planned and acknowledged while DOWN/DEGRADED are active problems users are hitting now; a status page that says "Under Maintenance" during an unrelated outage is lying in the reassuring direction. The rejected alternatives: canonicalizing the frontend's maintenance-first order (makes the masking bug the contract), and treating maintenance as UP at the display level for uniformity with group scoring (erases planned windows from badges that already have an "Under Maintenance" label). Group Monitor scoring deliberately keeps maintenance≈UP — it answers a different question (a member's planned work should not tank the group's derived status), whereas Overall Status answers "what should a visitor see". Also affirmed here rather than changed: the `_` badge is site-wide (every ACTIVE, non-hidden monitor), not page-scoped, so on multi-page installs it legitimately need not match any single page's banner — a documentation fact, not a bug. diff --git a/docs/adr/0008-explicit-deletes-over-fk-cascades.md b/docs/adr/0008-explicit-deletes-over-fk-cascades.md deleted file mode 100644 index 44c9e278..00000000 --- a/docs/adr/0008-explicit-deletes-over-fk-cascades.md +++ /dev/null @@ -1,9 +0,0 @@ -# Delete paths remove child rows explicitly and never rely on FK cascades - -The schema declares `ON DELETE CASCADE` foreign keys (alert configs → monitors, trigger/monitor junctions → configs, v2 alerts → configs), but SQLite — the default deployment — never enforces them: knex does not enable the `foreign_keys` pragma, and `knexfile.ts` does not either. Code that trusted the declared cascades (`deleteMonitorAlertConfig`, and `DeleteMonitorCompletelyUsingTag`, which skipped alert configs entirely) silently orphaned junction rows, configs, and `monitor_alerts_v2` state on SQLite, while behaving correctly on Postgres and MySQL (#716). - -We considered fixing the root cause instead: enabling `PRAGMA foreign_keys = ON` per connection via `pool.afterCreate`. That would make every declared cascade real and align the three databases. We rejected it because knex implements SQLite `ALTER TABLE` as a table rebuild, which violates FK constraints transiently — the multi-monitor-alerts migration already has to toggle the pragma off around its own rebuild — so enforcement would put every past and future migration on a tightrope, and it changes write-failure behavior for all existing SQLite installs at once (inserts that used to succeed against missing parents would start throwing). - -The rule is the opposite and uniform: a delete path owns its children and removes them explicitly, in child-before-parent order. `deleteMonitorAlertConfig` deletes v2 alerts, trigger junctions, and monitor junctions before the config row; `deleteMonitorAlertConfigsByMonitorTag` detaches the monitor — deleting its per-monitor `monitor_alerts_v2` rows, which outlive the detach when the config is shared — and routes zero-monitor configs through that same method; `DeleteMonitorCompletelyUsingTag` calls it alongside its other per-tag cleanups. The declared cascades stay in the schema — on Postgres/MySQL they make the explicit deletes idempotent no-ops, and they document intent — but no code may depend on them. - -A consequence: shared alert configs (one config spanning several monitors) survive the deletion of one member; a config dies only when its last monitor is detached. Monitor deletion also strips the tag from group monitors and rebalances remaining member weights equally — deletion never produces a stale member (see CONTEXT.md, "Stale Member").