wip: Declarative alerting starter (host-direct + OpenObserve capture) #683
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "sec-observability"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What & why
M6 security-audit follow-up to Daniel's question: "I THINK we have alerting in OpenObserve, but that isn't clear from the repo — can we encode it into the repo somehow?"
Discovered live reality (documented in
lib/doc/alerting.md)lyte.serverhost runs one OpenTelemetry collector (lib/modules/nixos/server.nix) scraping hostmetrics + node_exporter (systemd unit states) + journald/filelog and shipping to OpenObserve. On beefcake it also scrapes zfs/postgres exporters.prometheus.nixis not imported andgrafana.nixis commented out ("replaced by OpenObserve"). So there is no rule engine and no Alertmanager running. node_exporter answers on :9100 only because the OTel collector scrapes it.disk-alerts.nix(smartd/ZED → Matrix hookshot), deliberately host-direct./api/default/alerts{,/templates,/destinations}all return 401), but enumerating the actual definitions needs the OO root creds — that live read was blocked, so whether any OO alerts currently exist is unconfirmed. The export helper below captures them.Approach: don't rebuild the stack Daniel removed
Adding Prometheus+Alertmanager would duplicate a stack that was deliberately retired. Instead this follows the boundary
disk-alerts.nixalready set — box/pool-in-trouble alerts stay host-direct (must not depend on OO, which lives on that box/pool); metrics/log alerts go through OO.Tier 1 — host-direct, OO-independent (
matrix-alerts.nix), posts to the samedisk-alert-webhook-urlsecret (no new secret, works on deploy):OnFailure → Matrixon the critical units: caddy, stalwart, forgejo, tuwunel, knot, headscale (verified wired vianix eval). Fires with unit status + recent journal on crash/fail; a normal deploy restart is not a failure.Tier 2 — OpenObserve capture (
openobserve-alerts/), the direct answer to "encode it into the repo":openobserve-alerts-export(installed to $PATH): read-only dump of live alerts/templates/destinations → JSON underdefinitions/to commit.lyte.openobserveAlerts.enable, default off) that PUTs committed JSON back to OO. Off by default because it mutates live OO state and was untestable here. API paths confirmed from the OO router.Testing
nix build .#nixosConfigurations.beefcake…toplevel✅nix evalconfirms OnFailure on all 6 critical units + reconcile default = false ✅python3 -m py_compile reconcile.py✅Follow-ups (noted in the doc)
prometheus.nix/grafana.nix(separate PR).lyte.matrix-notifyhelper (the hookshot poster is intentionally duplicated from disk-alerts.nix, per repo policy against inline refactors).Declarative alerting starter (host-direct + OpenObserve capture)to wip: Declarative alerting starter (host-direct + OpenObserve capture)View command line instructions
Manual merge helper
Use this merge commit message when completing the merge manually.
Checkout
From your project repository, check out a new branch and test the changes.