wip: Declarative alerting starter (host-direct + OpenObserve capture)

lytedev commented

2026-07-01 10:57:14 -05:00

Owner

What & why

M6 security-audit follow-up to Daniel's question: "I THINK we have alerting in OpenObserve, but that isn't clear from the repo — can we encode it into the repo somehow?"

Discovered live reality (documented in `lib/doc/alerting.md`)

Telemetry: every lyte.server host runs one OpenTelemetry collector (lib/modules/nixos/server.nix) scraping hostmetrics + node_exporter (systemd unit states) + journald/filelog and shipping to OpenObserve. On beefcake it also scrapes zfs/postgres exporters.
Classic Prometheus + Grafana are dead code: prometheus.nix is not imported and grafana.nix is commented out ("replaced by OpenObserve"). So there is no rule engine and no Alertmanager running. node_exporter answers on :9100 only because the OTel collector scrapes it.
Only repo-encoded alerting was disk-alerts.nix (smartd/ZED → Matrix hookshot), deliberately host-direct.
OpenObserve alerts: the API endpoints exist (/api/default/alerts{,/templates,/destinations} all return 401), but enumerating the actual definitions needs the OO root creds — that live read was blocked, so whether any OO alerts currently exist is unconfirmed. The export helper below captures them.

Approach: don't rebuild the stack Daniel removed

Adding Prometheus+Alertmanager would duplicate a stack that was deliberately retired. Instead this follows the boundary disk-alerts.nix already set — box/pool-in-trouble alerts stay host-direct (must not depend on OO, which lives on that box/pool); metrics/log alerts go through OO.

Tier 1 — host-direct, OO-independent (matrix-alerts.nix), posts to the same disk-alert-webhook-url secret (no new secret, works on deploy):

OnFailure → Matrix on the critical units: caddy, stalwart, forgejo, tuwunel, knot, headscale (verified wired via nix eval). Fires with unit status + recent journal on crash/fail; a normal deploy restart is not a failure.
disk ≥ 90% hourly timer.

Tier 2 — OpenObserve capture (openobserve-alerts/), the direct answer to "encode it into the repo":

openobserve-alerts-export (installed to $PATH): read-only dump of live alerts/templates/destinations → JSON under definitions/ to commit.
Opt-in reconcile oneshot (lyte.openobserveAlerts.enable, default off) that PUTs committed JSON back to OO. Off by default because it mutates live OO state and was untestable here. API paths confirmed from the OO router.

Testing

nix build .#nixosConfigurations.beefcake…toplevel ✅
nix eval confirms OnFailure on all 6 critical units + reconcile default = false ✅
python3 -m py_compile reconcile.py ✅
Not deployed (per task: prepare + build-check + PR only).

Follow-ups (noted in the doc)

beefcake-down isn't self-detectable (Tier 1 + OO both run on beefcake) — needs an external watcher (pebble/rascal) or an OO alert over remote hosts.
Run the export as Daniel to confirm/capture any existing OO alerts; validate the reconcile schema against the running OO version before enabling.
Delete the dead prometheus.nix/grafana.nix (separate PR).
Extract a shared lyte.matrix-notify helper (the hookshot poster is intentionally duplicated from disk-alerts.nix, per repo policy against inline refactors).

## What & why M6 security-audit follow-up to Daniel's question: *"I THINK we have alerting in OpenObserve, but that isn't clear from the repo — can we encode it into the repo somehow?"* ### Discovered live reality (documented in `lib/doc/alerting.md`) - **Telemetry**: every `lyte.server` host runs one **OpenTelemetry collector** (`lib/modules/nixos/server.nix`) scraping hostmetrics + node_exporter (systemd unit states) + journald/filelog and shipping to **OpenObserve**. On beefcake it also scrapes zfs/postgres exporters. - **Classic Prometheus + Grafana are dead code**: `prometheus.nix` is *not imported* and `grafana.nix` is commented out ("replaced by OpenObserve"). So there is **no rule engine and no Alertmanager running**. node_exporter answers on :9100 only because the OTel collector scrapes it. - **Only repo-encoded alerting** was `disk-alerts.nix` (smartd/ZED → Matrix hookshot), deliberately host-direct. - **OpenObserve alerts**: the API endpoints exist (`/api/default/alerts{,/templates,/destinations}` all return 401), but enumerating the actual definitions needs the OO root creds — **that live read was blocked**, so whether any OO alerts currently exist is **unconfirmed**. The export helper below captures them. ### Approach: don't rebuild the stack Daniel removed Adding Prometheus+Alertmanager would duplicate a stack that was deliberately retired. Instead this follows the boundary `disk-alerts.nix` already set — *box/pool-in-trouble alerts stay host-direct (must not depend on OO, which lives on that box/pool); metrics/log alerts go through OO.* **Tier 1 — host-direct, OO-independent (`matrix-alerts.nix`)**, posts to the **same** `disk-alert-webhook-url` secret (no new secret, works on deploy): - `OnFailure → Matrix` on the critical units: **caddy, stalwart, forgejo, tuwunel, knot, headscale** (verified wired via `nix eval`). Fires with unit status + recent journal on crash/fail; a normal deploy restart is not a failure. - **disk ≥ 90%** hourly timer. **Tier 2 — OpenObserve capture (`openobserve-alerts/`)**, the direct answer to "encode it into the repo": - `openobserve-alerts-export` (installed to $PATH): read-only dump of live alerts/templates/destinations → JSON under `definitions/` to commit. - Opt-in reconcile oneshot (`lyte.openobserveAlerts.enable`, **default off**) that PUTs committed JSON back to OO. Off by default because it mutates live OO state and was untestable here. API paths confirmed from the OO router. ## Testing - `nix build .#nixosConfigurations.beefcake…toplevel` ✅ - `nix eval` confirms OnFailure on all 6 critical units + reconcile default = false ✅ - `python3 -m py_compile reconcile.py` ✅ - **Not deployed** (per task: prepare + build-check + PR only). ## Follow-ups (noted in the doc) - **beefcake-down isn't self-detectable** (Tier 1 + OO both run on beefcake) — needs an external watcher (pebble/rascal) or an OO alert over *remote* hosts. - Run the export as Daniel to confirm/capture any existing OO alerts; validate the reconcile schema against the running OO version before enabling. - Delete the dead `prometheus.nix`/`grafana.nix` (separate PR). - Extract a shared `lyte.matrix-notify` helper (the hookshot poster is intentionally duplicated from disk-alerts.nix, per repo policy against inline refactors).

lytedev added 1 commit

2026-07-01 10:57:14 -05:00

feat(beefcake): declarative alerting starter (host-direct + OpenObserve capture)

/ check-format (push) Has been cancelled

Details

/ build (push) Has been cancelled

Details

563829c1fa

M6 security-audit follow-up. Before this, the only alerting encoded in the
repo was disk-alerts.nix (smartd/ZED -> Matrix); metrics/logs are collected
by the OpenTelemetry collector -> OpenObserve, but classic Prometheus and
Grafana are unimported dead code, so there is no rule engine / Alertmanager,
and any OpenObserve alerts live only in its DB/UI (invisible to the repo).

Rather than stand up a parallel Prometheus+Alertmanager stack that Daniel
deliberately removed, this follows the boundary disk-alerts.nix already set:
box/pool-in-trouble alerts stay host-direct; metrics/log alerts go through OO.

- lib/doc/alerting.md: documents the discovered live reality + the design.
- matrix-alerts.nix (Tier 1, host-direct, OO-independent): OnFailure -> Matrix
  on the critical units (caddy, stalwart, forgejo, tuwunel, knot, headscale)
  and an hourly disk>=90% timer. Reuses the disk-alert-webhook-url secret, so
  no new secret and it works on deploy.
- openobserve-alerts/ (Tier 2): openobserve-alerts-export dumps live OO
  alerts/templates/destinations to JSON for committing, plus an opt-in
  (default-off) reconcile oneshot that re-applies them via the OO API.

Build-checked (beefcake toplevel); not deployed. OO alert enumeration was
blocked (couldn't auth to the live API), so the export/reconcile schema needs
a live populate+validate pass by Daniel.

lytedev added 1 commit

2026-07-01 11:13:42 -05:00

docs(alerting): document the val.town external uptime monitor (Tier 0)

/ check-format (push) Has been cancelled

Details

/ build (push) Has been cancelled

Details

cccaba90c6

Daniel already runs an off-site val.town cron (lytedev/SimpleSiteUptimeMonitor)
that GETs files.lyte.dev + openobserve.h.lyte.dev and emails via val.town's own
std/email if either is >=400/unfetchable. That is the external dead-man's-switch
for a total beefcake/Caddy outage — the one gap Tier 1/Tier 2 (both on beefcake)
can't cover, and its notification path is independent of beefcake's Stalwart.

- alerting.md: add a Tier 0 section; reword the beefcake-down gap to note it's
  covered externally, with a follow-up to broaden the endpoint list beyond
  files/openobserve (so a single-service outage behind a healthy Caddy also
  pages) and optionally add a same-tailnet backup watcher.
- external-uptime-monitor.tsx: source snapshot for the record (the val is
  authoritative; not built by the flake).

lytedev changed title from ~~Declarative alerting starter (host-direct + OpenObserve capture)~~ to wip: Declarative alerting starter (host-direct + OpenObserve capture)

2026-07-01 11:25:49 -05:00

lytedev added 1 commit

2026-07-01 11:48:07 -05:00

docs(alerting): correct Tier 0 notification-path claim (email delivery depends on beefcake)

/ check-format (push) Has been cancelled

Details

/ build (push) Has been cancelled

Details

5a1294b285

Earlier wording over-claimed that the val.town watcher's alert 'still arrives
when beefcake is down'. Only the SENDING is external: std/email with no to:
field delivers to the val.town account owner's registered address, and if that
is a @lyte.dev mailbox on Stalwart, the alert queues at the VPS relay and isn't
readable until beefcake returns — defeating the dead-man's-switch. Flag the
requirement (off-beefcake mailbox or a fully-external push channel) prominently
and note the Tier-1 Matrix/hookshot path is likewise beefcake-dependent.

lytedev added 1 commit

2026-07-01 11:54:50 -05:00

docs(alerting): switch Tier 0 external watcher to ntfy.sh push

/ check-format (push) Has been cancelled

Details

/ build (push) Has been cancelled

Details

9eba6c5085

Email delivery to @lyte.dev depends on beefcake's Stalwart, so it can't reach
Daniel during a beefcake outage. Replace it with an ntfy.sh push: a hosted
service reached with one fetch and read by a phone app, so both detection and
notification stay entirely off beefcake.

- external-uptime-monitor.tsx: rewrite to POST to ntfy.sh; topic/token/server
  come from val.town env vars (NTFY_TOPIC/NTFY_TOKEN/NTFY_SERVER) so the topic
  isn't committed; emoji via Tags header (ASCII-only header values), urgent
  priority; expanded endpoint list (files/mail/git/matrix/openobserve).
- alerting.md: rewrite the Tier 0 section for ntfy, add one-time setup steps
  (reserved topic + token recommended), and mark the beefcake-down gap covered.

lytedev added 1 commit

2026-07-01 12:04:21 -05:00

chore(secrets): add ntfy-sh-topic-url to beefcake secrets

/ check-format (push) Has been cancelled

Details

/ build (push) Has been cancelled

Details

dfe42a6916

The ntfy topic URL for the Tier 0 external uptime watcher (see
lib/doc/alerting.md). Currently unreferenced by any nix module — the val.town
watcher reads the topic from its own env, not sops — so this is captured for the
record and for a future same-tailnet backup watcher (pebble/rascal) that could
push to the same topic without depending on val.town.

lytedev added 1 commit

2026-07-01 12:05:30 -05:00

docs(alerting): Tier 0 watcher emails AND pushes to ntfy (dual-channel)

/ check-format (push) Successful in 7s

Details

/ build (push) Successful in 6m10s

Details

ed5f40d584

Keep the std/email leg as a durable backup/record and add the ntfy.sh push as
the reliable, beefcake-independent alert. ntfy topic comes from NTFY_URL (full
topic URL, matching the sops ntfy-sh-topic-url key) + optional NTFY_TOKEN; both
channels fire best-effort so one failing doesn't suppress the other.

lytedev added 1 commit

2026-07-01 14:46:15 -05:00

docs(alerting): Tier 0 watcher hits health endpoints, not roots

/ check-format (push) Successful in 8s

Details

/ build (push) Successful in 6m25s

Details

8868eff18a

mail.lyte.dev serves 404 at / even when healthy (would false-page every run);
switch each check to its health/semantic endpoint (verified live): Stalwart
/healthz/live, Forgejo /api/healthz, Matrix /_matrix/client/versions, OpenObserve
/healthz, files root. Also confirms the backend actually serves, not just that
Caddy answered. Mirrors the change pushed to the live val.

lytedev referenced this pull request

2026-07-01 15:03:10 -05:00

Self-host ntfy on pebble for private off-site alerts (ntfy.e.lyte.dev) #690

/ check-format (push) Successful in 8s

Required

Details

/ build (push) Successful in 6m25s

Required

Details

This pull request has changes conflicting with the target branch.

secrets/beefcake/secrets.yml

View command line instructions

Manual merge helper

Use this merge commit message when completing the merge manually.

Merge commit title

Merge pull request 'wip: Declarative alerting starter (host-direct + OpenObserve capture)' (#683) from sec-observability into main

Merge commit body

Reviewed-on: https://git.lyte.dev/lytedev/nix/pulls/683

Merge pull request 'wip: Declarative alerting starter (host-direct + OpenObserve capture)' (#683) from sec-observability into main

Reviewed-on: https://git.lyte.dev/lytedev/nix/pulls/683

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin sec-observability:sec-observability

git switch sec-observability

Rows
Columns

wip: Declarative alerting starter (host-direct + OpenObserve capture) #683

What & why

Discovered live reality (documented in lib/doc/alerting.md)

Approach: don't rebuild the stack Daniel removed

Testing

Follow-ups (noted in the doc)

Manual merge helper

Checkout

Discovered live reality (documented in `lib/doc/alerting.md`)