Skip to content

Observability stack on meridian — nixOS module spec

This document specifies the services to add to meridian's nixOS config to support observability for be-platform (Rust + Rocket + SQLite) and be-pwa (TypeScript + React + Vite, served as static files), both of which already run as podman containers on meridian.

The audience is the nixOS-config agent. Application-side instrumentation is the responsibility of separate tickets and is described here only as context.

Goal

Five services that together cover:

  1. Logs (Loki) — structured logs from be-platform + access logs from the PWA's reverse proxy
  2. Metrics (Prometheus) — backend RPS / latency / error rate / SQLite size, plus PWA web-vitals
  3. Visualization (Grafana) — dashboards for the above + Loki/Prometheus query UI
  4. Collector (Grafana Alloy) — single agent on meridian that reads journald, scrapes Prometheus targets, accepts OTLP + Faro from the PWA, ships to Loki + Prometheus
  5. Error tracking (GlitchTip) — Sentry-SDK-compatible self-hosted error tracking for both apps

We are explicitly NOT deploying Mimir or Tempo for now. They can be added later if Prometheus retention becomes a bottleneck (unlikely at this scale) or distributed tracing becomes necessary (also unlikely at this scale).

Access model — Tailscale-only

All five services bind to the Tailscale interface only. No Cloudflare, no public DNS. Reachable via <service>.meridian.<tailnet-suffix> from machines on the tailnet.

Exception worth considering: the GlitchTip web UI and event ingestion endpoint may need to be public-ish so the deployed PWA running on users' phones can POST errors. Two options:

  • (a) Expose only the GlitchTip event ingestion endpoint via Cloudflare (the rest of GlitchTip — admin UI, settings — stays Tailscale-only)
  • (b) Run a thin proxy in be-platform that accepts events from the PWA and forwards to GlitchTip's internal endpoint

Default to (a). Less code to write, no proxy to maintain. Note this in the Cloudflare config — only the /api/<n>/store/ style ingestion paths get exposed.

Services

1. Loki

  • Port: 3100
  • Mode: Single-binary (filesystem storage, no S3/MinIO)
  • Storage volume: /var/lib/observability/loki/ (chunks + index + WAL)
  • Retention: 30 days
  • Expected disk: ~5GB after 30 days at current log volume; size for 20GB to be safe
  • RAM: ~200MB
  • Config knobs:
  • auth_enabled: false (Tailscale ACL is the auth boundary)
  • Single tenant
  • compactor enabled for retention enforcement
  • NixOS module: services.loki exists in nixpkgs, prefer it over container

2. Prometheus

  • Port: 9090
  • Storage volume: /var/lib/observability/prometheus/ (TSDB)
  • Retention: 90 days
  • Expected disk: ~3GB at 90 days for a 1-binary scrape target; size for 10GB
  • RAM: ~300MB
  • Scrape targets (managed via NixOS config — list will grow):
  • be-platform (host: localhost, port: TBD — backend agent will expose /metrics; default 8000/metrics per Rocket convention)
  • node-exporter (already running? add if not — port 9100)
  • alloy self-metrics (port 12345/metrics)
  • Scrape interval: 15s default
  • NixOS module: services.prometheus

3. Grafana

  • Port: 3000
  • Storage volume: /var/lib/observability/grafana/ (sqlite-backed config, dashboards, users)
  • Auth: Anonymous viewer disabled. Admin password from secrets store on first boot. Add a single shared "operator" user for now (Jeremy + future ops). Don't wire OAuth yet — Tailscale is the perimeter.
  • Datasources (provisioned via NixOS config, not the UI):
  • Loki at http://localhost:3100
  • Prometheus at http://localhost:9090
  • Dashboards: None to provision at install time. The application-side ticket will add starter dashboards as JSON in /var/lib/observability/grafana/provisioned/.
  • RAM: ~150MB
  • NixOS module: services.grafana — set provision.datasources.settings

4. Grafana Alloy (collector)

  • Port: 12345 (admin/self-metrics), 4317 (OTLP gRPC), 4318 (OTLP HTTP), 12347 (Faro frontend telemetry)
  • Storage: None required (stateless collector)
  • Config file: /etc/alloy/config.alloy
  • Pipelines needed:
  • journald → Loki: read podman-balanced-engineering-platform.service + podman-balanced-engineering-pwa.service (or whatever the be-pwa service is named), tag with app, host, service_unit, send to Loki
  • OTLP → Loki/Prometheus: accept OTLP-format traces+logs+metrics from be-platform when we instrument it. For now this is a passthrough — traces just drop on the floor since we have no Tempo. Logs/metrics route appropriately.
  • Faro → Loki: accept browser RUM events from the PWA. Faro events are JSON; ship them to Loki tagged source=pwa-faro. (Faro adds a 5th port; if simpler, skip Faro for v1 and just have the PWA POST structured errors to GlitchTip.)
  • NixOS: No first-party module yet — install via tarball/binary + systemd unit, or use virtualisation.oci-containers with the official grafana/alloy:latest image. Pin to a specific version.

5. GlitchTip (error tracking)

GlitchTip is a self-hosted, Sentry-SDK-compatible error tracker. Existing Sentry SDKs in be-platform / be-pwa work pointed at GlitchTip with no code changes. ~4 services vs Sentry's ~10. Postgres-backed instead of ClickHouse + Kafka.

  • Web UI port: 8000 (internal — needs a different external port since be-platform also uses 8000; map to 8200)
  • Ingestion endpoint: Same port, paths /api/<projectId>/envelope/ and /api/<projectId>/store/
  • Dependencies it brings:
  • PostgreSQL — dedicated DB for glitchtip
  • Redis — cache + Celery broker
  • Celery worker — async event processing
  • Celery beat — scheduled tasks
  • Storage:
  • /var/lib/observability/glitchtip/postgres/ (Postgres data dir)
  • /var/lib/observability/glitchtip/uploads/ (attachments, minidumps)
  • Retention: 90 days of events
  • RAM: Postgres ~200MB + redis ~50MB + 2× Celery + web = ~800MB total. Allocate 1.5GB headroom.
  • NixOS: No first-party module. Use virtualisation.oci-containers with the official glitchtip/glitchtip image. Pin a version.
  • Secrets needed (deploy via sops-nix or whatever you use today):
  • GLITCHTIP_SECRET_KEY — Django secret (generate 64 random bytes)
  • GLITCHTIP_DATABASE_URLpostgres://glitchtip:<pw>@localhost:5433/glitchtip (separate port from any existing Postgres)
  • GLITCHTIP_DEFAULT_FROM_EMAIL — from address for issue notifications
  • Email settings — reuse Resend (be-platform already uses it) or whatever SMTP is configured
  • External access: Cloudflare route for /api/*/envelope/ and /api/*/store/ only, pointed at meridian:8200. Admin UI stays Tailscale-only at meridian.<tailnet>:8200/.

Resource budget for the whole stack

Component RAM Disk
Loki 200MB 20GB
Prometheus 300MB 10GB
Grafana 150MB 1GB
Alloy 200MB
GlitchTip (all) 1.5GB 10GB
Total ~2.4GB ~41GB

Meridian currently has 16GB RAM per the project-instructions memo (the Dockerfile dep-cache notes "16384 MiB"). 2.4GB is fine.

Storage: confirm /var/lib/observability/ has 50GB+ available. If meridian's root filesystem is tight, mount a dedicated volume.

Dependency / boot order

The systemd ordering should be roughly:

  1. postgresql-glitchtip.service + redis-glitchtip.service (or whatever you name them)
  2. glitchtip-web.service + glitchtip-worker.service + glitchtip-beat.service — depend on (1)
  3. loki.service
  4. prometheus.service
  5. alloy.service — depends on (3) + (4) being reachable (it'll retry, so soft dep is fine)
  6. grafana.service — depends on (3) + (4) being reachable

The two podman containers be-platform + be-pwa keep their existing boot order — they don't depend on observability being up. Observability should fail open.

Backup policy

  • /var/lib/observability/grafana/grafana.db — daily snapshot, retain 7 days. Contains dashboards + users.
  • /var/lib/observability/glitchtip/postgres/pg_dump daily, retain 14 days.
  • /var/lib/observability/loki/ — no backup. Logs are reproducible-ish from journald; not worth the disk.
  • /var/lib/observability/prometheus/ — no backup. Same reasoning.

If meridian dies, dashboards + error history matter. Logs and raw metrics don't.

Application-side integration (informational — not your scope)

This is what the app-side tickets will do once the stack is up. Listed here so you understand which ports / paths need to exist.

be-platform: - Will add a Prometheus metrics endpoint at /metrics on port 8000 (or whatever its existing Rocket port is). Prometheus scrape target. - Will configure tracing/logging to either: (a) write structured JSON to stderr (Alloy reads via journald), or (b) ship OTLP to Alloy at localhost:4318. Pick (a) for v1. - Will install the sentry crate pointed at GlitchTip's project DSN.

be-pwa: - Will install @sentry/react pointed at GlitchTip's project DSN. - May install Grafana Faro browser SDK pointed at Alloy's Faro endpoint (port 12347). Defer this until phase 2 if it's painful.

What I need from you

A nixOS PR (or branch on whatever repo manages meridian's config) that:

  1. Creates the persistent volumes under /var/lib/observability/
  2. Adds the 5 services with the configs above
  3. Exposes them on the Tailscale interface only (except the GlitchTip event ingestion path)
  4. Provisions Grafana datasources via config (not UI)
  5. Wires up the backup cron jobs
  6. Sets up the secrets via your existing secrets path
  7. Returns:
  8. The URLs / Tailscale hostnames for Grafana, Loki, Prometheus, Alloy, GlitchTip
  9. The Grafana initial admin password (after first boot)
  10. The GlitchTip first-user signup URL (or seed an initial admin user)
  11. The Sentry DSN for the GlitchTip "be-platform" and "be-pwa" projects (after creating them in the UI)
  12. The Alloy OTLP HTTP endpoint (for be-platform to ship to)
  13. Confirmation of the Cloudflare route for /api/*/envelope/ → meridian:8200

Once those are returned, the next step on the application side is two tickets (one per app) wiring Sentry SDKs + Prometheus /metrics + structured logging. Those land independently.

Versions to pin

Don't track latest for any of these. Initial pins (current stable as of writing — update before deploy):

  • Loki: 3.5.x
  • Prometheus: 2.55.x (or 3.x if 3.x has stabilized)
  • Grafana: 11.3.x
  • Alloy: 1.5.x
  • GlitchTip: 4.x

Out of scope for this delivery

  • Mimir / Tempo — not now
  • Loki object storage (S3 / MinIO) — not at this scale
  • Alertmanager — phase 2; we'll wire alerts after we have dashboards worth alerting on
  • PWA Faro RUM — phase 2 unless you find it trivial
  • OAuth/SSO for Grafana — Tailscale is the perimeter; revisit if you ever expose Grafana publicly
  • Long-term log shipping (e.g. to S3) — phase 2 if compliance ever requires it

One thing I'd appreciate

If you hit a service where the nixOS module is awkward or the container image misbehaves, flag it before forcing it. Two of these (Alloy, GlitchTip) are container-only or might have rough edges in nix. I'd rather know early than discover it after install.