docs: document Last known status default and amend ADR 0005

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Raj Nandan Sharma
2026-06-07 22:53:57 +05:30
parent 758cf5e4d5
commit c4c16d65a6
2 changed files with 37 additions and 1 deletions
@@ -1,6 +1,6 @@
# Alerts evaluate alert-visible samples, not just REALTIME ones
The consecutive-sample checks behind alert evaluation (`consecutivelyStatusFor`, `consecutivelyLatencyGreaterThan`, `consecutivelyLatencyLessThan` in `src/lib/server/db/repositories/monitoring.ts`) consider samples whose type is `REALTIME`, `ERROR`, `TIMEOUT`, `MANUAL`, or `DEFAULT_STATUS` — the "alert-visible" set — instead of `REALTIME` only. Both data-API PATCH endpoints (single timestamp and range) enqueue one alert evaluation after writing `MANUAL` rows. `SIGNAL` rows and `INCIDENT`/`MAINTENANCE` overlay rows remain invisible to alerting.
The consecutive-sample checks behind alert evaluation (`consecutivelyStatusFor`, `consecutivelyLatencyGreaterThan`, `consecutivelyLatencyLessThan` in `src/lib/server/db/repositories/monitoring.ts`) consider samples whose type is `REALTIME`, `ERROR`, `TIMEOUT`, `MANUAL`, or `DEFAULT_STATUS` — the "alert-visible" set — instead of `REALTIME` only. Both data-API PATCH endpoints (single timestamp and range) enqueue one alert evaluation after writing `MANUAL` rows. `SIGNAL` rows and `INCIDENT`/`MAINTENANCE` overlay rows remain invisible to alerting. Amended by ADR 0006: last-known-status fill (`CARRIED`) later joined the alert-visible set under the same invariant.
Two issues drove this. In #633, a GameDig monitor showed DOWN on the status page but never alerted: a down game server makes `GameDig.query` throw, so every down-sample is recorded as `ERROR`, which the old `type = REALTIME` filter excluded — the "N consecutive DOWN" condition could never become true. The same failure mode silently broke gRPC, SQL, and SSL monitors (hard-down records `ERROR`) and API monitors whose outage manifests as timeouts (`TIMEOUT`). In #720, a NONE monitor driven by the data API never alerted for two stacked reasons: PATCH writes `MANUAL` rows the filter excluded, and the endpoint never enqueued evaluation at all. The status page and UPTIME alerts have no type filter, which is why users saw DOWN while alerts stayed silent.
@@ -29,6 +29,42 @@ So realtime checks do not override an active maintenance or incident state.
Monitors run from cron expressions (for example `* * * * *` for every minute). Use tighter schedules for critical services and relaxed schedules for low-risk dependencies.
## Default Status {#default-status}
Default Status is the monitor's answer to the question: **what does a minute with no monitoring sample mean?**
| Value | Behavior |
|---|---|
| `NONE` | Gap minutes show as no data (gray) |
| `UP` | A `DEFAULT` sample is written each minute marking the service UP |
| `DOWN` | A `DEFAULT` sample is written each minute marking the service DOWN |
| `DEGRADED` | A `DEFAULT` sample is written each minute marking the service DEGRADED |
| `LAST_KNOWN` | Each minute without a new sample, Kener writes a `CARRIED` row repeating the most recent alert-visible status and latency |
### Last known status {#last-known-status}
`LAST_KNOWN` is only available on **Manual (`NONE`-type) monitors**. If you select it on any other monitor type, the API resets it to `UP`. Changing a monitor's type away from Manual also resets it to `UP`.
How it works:
- Every scheduler tick with no new data, Kener writes a `CARRIED` sample copying the status and latency of the most recent alert-visible sample.
- Carry is tick-forward only — it starts at the next scheduler tick after you save the setting, with no backfill of past gaps.
- Carried rows persist in history even if you later change the setting.
Example push flow:
```bash
curl -X PATCH 'https://status.example.com/api/v4/monitors/my-service/data/{current_unix_minute}' \
-H 'Authorization: Bearer <api-key>' \
-H 'Content-Type: application/json' \
--data '{"status": "DOWN", "latency": 100}'
# With Default Status = Last known status, the monitor stays DOWN until you push UP.
```
> [!WARNING]
> - If your integration stops sending, the page keeps showing the last status indefinitely — Kener cannot tell "still up" from "stopped reporting". Use a [Heartbeat monitor](/docs/v4/monitors/heartbeat) to catch a silent integration.
> - Carried minutes count toward alert thresholds: a single DOWN push will trigger alerts after your failure threshold, and they stay triggered until you push a recovery.
## Uptime calculation {#uptime-calculation}
Default uptime formula: