wip: Declarative alerting starter (host-direct + OpenObserve capture) #683

Draft
lytedev wants to merge 7 commits from sec-observability into main
Owner

What & why

M6 security-audit follow-up to Daniel's question: "I THINK we have alerting in OpenObserve, but that isn't clear from the repo — can we encode it into the repo somehow?"

Discovered live reality (documented in lib/doc/alerting.md)

  • Telemetry: every lyte.server host runs one OpenTelemetry collector (lib/modules/nixos/server.nix) scraping hostmetrics + node_exporter (systemd unit states) + journald/filelog and shipping to OpenObserve. On beefcake it also scrapes zfs/postgres exporters.
  • Classic Prometheus + Grafana are dead code: prometheus.nix is not imported and grafana.nix is commented out ("replaced by OpenObserve"). So there is no rule engine and no Alertmanager running. node_exporter answers on :9100 only because the OTel collector scrapes it.
  • Only repo-encoded alerting was disk-alerts.nix (smartd/ZED → Matrix hookshot), deliberately host-direct.
  • OpenObserve alerts: the API endpoints exist (/api/default/alerts{,/templates,/destinations} all return 401), but enumerating the actual definitions needs the OO root creds — that live read was blocked, so whether any OO alerts currently exist is unconfirmed. The export helper below captures them.

Approach: don't rebuild the stack Daniel removed

Adding Prometheus+Alertmanager would duplicate a stack that was deliberately retired. Instead this follows the boundary disk-alerts.nix already set — box/pool-in-trouble alerts stay host-direct (must not depend on OO, which lives on that box/pool); metrics/log alerts go through OO.

Tier 1 — host-direct, OO-independent (matrix-alerts.nix), posts to the same disk-alert-webhook-url secret (no new secret, works on deploy):

  • OnFailure → Matrix on the critical units: caddy, stalwart, forgejo, tuwunel, knot, headscale (verified wired via nix eval). Fires with unit status + recent journal on crash/fail; a normal deploy restart is not a failure.
  • disk ≥ 90% hourly timer.

Tier 2 — OpenObserve capture (openobserve-alerts/), the direct answer to "encode it into the repo":

  • openobserve-alerts-export (installed to $PATH): read-only dump of live alerts/templates/destinations → JSON under definitions/ to commit.
  • Opt-in reconcile oneshot (lyte.openobserveAlerts.enable, default off) that PUTs committed JSON back to OO. Off by default because it mutates live OO state and was untestable here. API paths confirmed from the OO router.

Testing

  • nix build .#nixosConfigurations.beefcake…toplevel
  • nix eval confirms OnFailure on all 6 critical units + reconcile default = false
  • python3 -m py_compile reconcile.py
  • Not deployed (per task: prepare + build-check + PR only).

Follow-ups (noted in the doc)

  • beefcake-down isn't self-detectable (Tier 1 + OO both run on beefcake) — needs an external watcher (pebble/rascal) or an OO alert over remote hosts.
  • Run the export as Daniel to confirm/capture any existing OO alerts; validate the reconcile schema against the running OO version before enabling.
  • Delete the dead prometheus.nix/grafana.nix (separate PR).
  • Extract a shared lyte.matrix-notify helper (the hookshot poster is intentionally duplicated from disk-alerts.nix, per repo policy against inline refactors).
## What & why M6 security-audit follow-up to Daniel's question: *"I THINK we have alerting in OpenObserve, but that isn't clear from the repo — can we encode it into the repo somehow?"* ### Discovered live reality (documented in `lib/doc/alerting.md`) - **Telemetry**: every `lyte.server` host runs one **OpenTelemetry collector** (`lib/modules/nixos/server.nix`) scraping hostmetrics + node_exporter (systemd unit states) + journald/filelog and shipping to **OpenObserve**. On beefcake it also scrapes zfs/postgres exporters. - **Classic Prometheus + Grafana are dead code**: `prometheus.nix` is *not imported* and `grafana.nix` is commented out ("replaced by OpenObserve"). So there is **no rule engine and no Alertmanager running**. node_exporter answers on :9100 only because the OTel collector scrapes it. - **Only repo-encoded alerting** was `disk-alerts.nix` (smartd/ZED → Matrix hookshot), deliberately host-direct. - **OpenObserve alerts**: the API endpoints exist (`/api/default/alerts{,/templates,/destinations}` all return 401), but enumerating the actual definitions needs the OO root creds — **that live read was blocked**, so whether any OO alerts currently exist is **unconfirmed**. The export helper below captures them. ### Approach: don't rebuild the stack Daniel removed Adding Prometheus+Alertmanager would duplicate a stack that was deliberately retired. Instead this follows the boundary `disk-alerts.nix` already set — *box/pool-in-trouble alerts stay host-direct (must not depend on OO, which lives on that box/pool); metrics/log alerts go through OO.* **Tier 1 — host-direct, OO-independent (`matrix-alerts.nix`)**, posts to the **same** `disk-alert-webhook-url` secret (no new secret, works on deploy): - `OnFailure → Matrix` on the critical units: **caddy, stalwart, forgejo, tuwunel, knot, headscale** (verified wired via `nix eval`). Fires with unit status + recent journal on crash/fail; a normal deploy restart is not a failure. - **disk ≥ 90%** hourly timer. **Tier 2 — OpenObserve capture (`openobserve-alerts/`)**, the direct answer to "encode it into the repo": - `openobserve-alerts-export` (installed to $PATH): read-only dump of live alerts/templates/destinations → JSON under `definitions/` to commit. - Opt-in reconcile oneshot (`lyte.openobserveAlerts.enable`, **default off**) that PUTs committed JSON back to OO. Off by default because it mutates live OO state and was untestable here. API paths confirmed from the OO router. ## Testing - `nix build .#nixosConfigurations.beefcake…toplevel` ✅ - `nix eval` confirms OnFailure on all 6 critical units + reconcile default = false ✅ - `python3 -m py_compile reconcile.py` ✅ - **Not deployed** (per task: prepare + build-check + PR only). ## Follow-ups (noted in the doc) - **beefcake-down isn't self-detectable** (Tier 1 + OO both run on beefcake) — needs an external watcher (pebble/rascal) or an OO alert over *remote* hosts. - Run the export as Daniel to confirm/capture any existing OO alerts; validate the reconcile schema against the running OO version before enabling. - Delete the dead `prometheus.nix`/`grafana.nix` (separate PR). - Extract a shared `lyte.matrix-notify` helper (the hookshot poster is intentionally duplicated from disk-alerts.nix, per repo policy against inline refactors).
feat(beefcake): declarative alerting starter (host-direct + OpenObserve capture)
Some checks failed
/ check-format (push) Has been cancelled
/ build (push) Has been cancelled
563829c1fa
M6 security-audit follow-up. Before this, the only alerting encoded in the
repo was disk-alerts.nix (smartd/ZED -> Matrix); metrics/logs are collected
by the OpenTelemetry collector -> OpenObserve, but classic Prometheus and
Grafana are unimported dead code, so there is no rule engine / Alertmanager,
and any OpenObserve alerts live only in its DB/UI (invisible to the repo).

Rather than stand up a parallel Prometheus+Alertmanager stack that Daniel
deliberately removed, this follows the boundary disk-alerts.nix already set:
box/pool-in-trouble alerts stay host-direct; metrics/log alerts go through OO.

- lib/doc/alerting.md: documents the discovered live reality + the design.
- matrix-alerts.nix (Tier 1, host-direct, OO-independent): OnFailure -> Matrix
  on the critical units (caddy, stalwart, forgejo, tuwunel, knot, headscale)
  and an hourly disk>=90% timer. Reuses the disk-alert-webhook-url secret, so
  no new secret and it works on deploy.
- openobserve-alerts/ (Tier 2): openobserve-alerts-export dumps live OO
  alerts/templates/destinations to JSON for committing, plus an opt-in
  (default-off) reconcile oneshot that re-applies them via the OO API.

Build-checked (beefcake toplevel); not deployed. OO alert enumeration was
blocked (couldn't auth to the live API), so the export/reconcile schema needs
a live populate+validate pass by Daniel.
docs(alerting): document the val.town external uptime monitor (Tier 0)
Some checks failed
/ check-format (push) Has been cancelled
/ build (push) Has been cancelled
cccaba90c6
Daniel already runs an off-site val.town cron (lytedev/SimpleSiteUptimeMonitor)
that GETs files.lyte.dev + openobserve.h.lyte.dev and emails via val.town's own
std/email if either is >=400/unfetchable. That is the external dead-man's-switch
for a total beefcake/Caddy outage — the one gap Tier 1/Tier 2 (both on beefcake)
can't cover, and its notification path is independent of beefcake's Stalwart.

- alerting.md: add a Tier 0 section; reword the beefcake-down gap to note it's
  covered externally, with a follow-up to broaden the endpoint list beyond
  files/openobserve (so a single-service outage behind a healthy Caddy also
  pages) and optionally add a same-tailnet backup watcher.
- external-uptime-monitor.tsx: source snapshot for the record (the val is
  authoritative; not built by the flake).
lytedev changed title from Declarative alerting starter (host-direct + OpenObserve capture) to wip: Declarative alerting starter (host-direct + OpenObserve capture) 2026-07-01 11:25:49 -05:00
docs(alerting): correct Tier 0 notification-path claim (email delivery depends on beefcake)
Some checks failed
/ check-format (push) Has been cancelled
/ build (push) Has been cancelled
5a1294b285
Earlier wording over-claimed that the val.town watcher's alert 'still arrives
when beefcake is down'. Only the SENDING is external: std/email with no to:
field delivers to the val.town account owner's registered address, and if that
is a @lyte.dev mailbox on Stalwart, the alert queues at the VPS relay and isn't
readable until beefcake returns — defeating the dead-man's-switch. Flag the
requirement (off-beefcake mailbox or a fully-external push channel) prominently
and note the Tier-1 Matrix/hookshot path is likewise beefcake-dependent.
docs(alerting): switch Tier 0 external watcher to ntfy.sh push
Some checks failed
/ check-format (push) Has been cancelled
/ build (push) Has been cancelled
9eba6c5085
Email delivery to @lyte.dev depends on beefcake's Stalwart, so it can't reach
Daniel during a beefcake outage. Replace it with an ntfy.sh push: a hosted
service reached with one fetch and read by a phone app, so both detection and
notification stay entirely off beefcake.

- external-uptime-monitor.tsx: rewrite to POST to ntfy.sh; topic/token/server
  come from val.town env vars (NTFY_TOPIC/NTFY_TOKEN/NTFY_SERVER) so the topic
  isn't committed; emoji via Tags header (ASCII-only header values), urgent
  priority; expanded endpoint list (files/mail/git/matrix/openobserve).
- alerting.md: rewrite the Tier 0 section for ntfy, add one-time setup steps
  (reserved topic + token recommended), and mark the beefcake-down gap covered.
chore(secrets): add ntfy-sh-topic-url to beefcake secrets
Some checks failed
/ check-format (push) Has been cancelled
/ build (push) Has been cancelled
dfe42a6916
The ntfy topic URL for the Tier 0 external uptime watcher (see
lib/doc/alerting.md). Currently unreferenced by any nix module — the val.town
watcher reads the topic from its own env, not sops — so this is captured for the
record and for a future same-tailnet backup watcher (pebble/rascal) that could
push to the same topic without depending on val.town.
docs(alerting): Tier 0 watcher emails AND pushes to ntfy (dual-channel)
All checks were successful
/ check-format (push) Successful in 7s
/ build (push) Successful in 6m10s
ed5f40d584
Keep the std/email leg as a durable backup/record and add the ntfy.sh push as
the reliable, beefcake-independent alert. ntfy topic comes from NTFY_URL (full
topic URL, matching the sops ntfy-sh-topic-url key) + optional NTFY_TOKEN; both
channels fire best-effort so one failing doesn't suppress the other.
docs(alerting): Tier 0 watcher hits health endpoints, not roots
All checks were successful
/ check-format (push) Successful in 8s
/ build (push) Successful in 6m25s
8868eff18a
mail.lyte.dev serves 404 at / even when healthy (would false-page every run);
switch each check to its health/semantic endpoint (verified live): Stalwart
/healthz/live, Forgejo /api/healthz, Matrix /_matrix/client/versions, OpenObserve
/healthz, files root. Also confirms the backend actually serves, not just that
Caddy answered. Mirrors the change pushed to the live val.
All checks were successful
/ check-format (push) Successful in 8s
Required
Details
/ build (push) Successful in 6m25s
Required
Details
This pull request has changes conflicting with the target branch.
  • secrets/beefcake/secrets.yml
View command line instructions

Manual merge helper

Use this merge commit message when completing the merge manually.

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin sec-observability:sec-observability
git switch sec-observability
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lytedev/nix!683
No description provided.