Observability stack on meridian — nixOS module spec¶
This document specifies the services to add to meridian's nixOS config to
support observability for be-platform (Rust + Rocket + SQLite) and
be-pwa (TypeScript + React + Vite, served as static files), both of
which already run as podman containers on meridian.
The audience is the nixOS-config agent. Application-side instrumentation is the responsibility of separate tickets and is described here only as context.
Goal¶
Five services that together cover:
- Logs (Loki) — structured logs from be-platform + access logs from the PWA's reverse proxy
- Metrics (Prometheus) — backend RPS / latency / error rate / SQLite size, plus PWA web-vitals
- Visualization (Grafana) — dashboards for the above + Loki/Prometheus query UI
- Collector (Grafana Alloy) — single agent on meridian that reads journald, scrapes Prometheus targets, accepts OTLP + Faro from the PWA, ships to Loki + Prometheus
- Error tracking (GlitchTip) — Sentry-SDK-compatible self-hosted error tracking for both apps
We are explicitly NOT deploying Mimir or Tempo for now. They can be added later if Prometheus retention becomes a bottleneck (unlikely at this scale) or distributed tracing becomes necessary (also unlikely at this scale).
Access model — Tailscale-only¶
All five services bind to the Tailscale interface only. No Cloudflare,
no public DNS. Reachable via <service>.meridian.<tailnet-suffix> from
machines on the tailnet.
Exception worth considering: the GlitchTip web UI and event ingestion endpoint may need to be public-ish so the deployed PWA running on users' phones can POST errors. Two options:
- (a) Expose only the GlitchTip event ingestion endpoint via Cloudflare (the rest of GlitchTip — admin UI, settings — stays Tailscale-only)
- (b) Run a thin proxy in be-platform that accepts events from the PWA and forwards to GlitchTip's internal endpoint
Default to (a). Less code to write, no proxy to maintain. Note this in
the Cloudflare config — only the /api/<n>/store/ style ingestion paths
get exposed.
Services¶
1. Loki¶
- Port:
3100 - Mode: Single-binary (filesystem storage, no S3/MinIO)
- Storage volume:
/var/lib/observability/loki/(chunks + index + WAL) - Retention: 30 days
- Expected disk: ~5GB after 30 days at current log volume; size for 20GB to be safe
- RAM: ~200MB
- Config knobs:
auth_enabled: false(Tailscale ACL is the auth boundary)- Single tenant
compactorenabled for retention enforcement- NixOS module:
services.lokiexists in nixpkgs, prefer it over container
2. Prometheus¶
- Port:
9090 - Storage volume:
/var/lib/observability/prometheus/(TSDB) - Retention: 90 days
- Expected disk: ~3GB at 90 days for a 1-binary scrape target; size for 10GB
- RAM: ~300MB
- Scrape targets (managed via NixOS config — list will grow):
be-platform(host: localhost, port: TBD — backend agent will expose/metrics; default8000/metricsper Rocket convention)node-exporter(already running? add if not — port9100)alloyself-metrics (port12345/metrics)- Scrape interval: 15s default
- NixOS module:
services.prometheus
3. Grafana¶
- Port:
3000 - Storage volume:
/var/lib/observability/grafana/(sqlite-backed config, dashboards, users) - Auth: Anonymous viewer disabled. Admin password from secrets store on first boot. Add a single shared "operator" user for now (Jeremy + future ops). Don't wire OAuth yet — Tailscale is the perimeter.
- Datasources (provisioned via NixOS config, not the UI):
- Loki at
http://localhost:3100 - Prometheus at
http://localhost:9090 - Dashboards: None to provision at install time. The application-side ticket will add starter dashboards as JSON in
/var/lib/observability/grafana/provisioned/. - RAM: ~150MB
- NixOS module:
services.grafana— setprovision.datasources.settings
4. Grafana Alloy (collector)¶
- Port:
12345(admin/self-metrics),4317(OTLP gRPC),4318(OTLP HTTP),12347(Faro frontend telemetry) - Storage: None required (stateless collector)
- Config file:
/etc/alloy/config.alloy - Pipelines needed:
- journald → Loki: read
podman-balanced-engineering-platform.service+podman-balanced-engineering-pwa.service(or whatever the be-pwa service is named), tag withapp,host,service_unit, send to Loki - OTLP → Loki/Prometheus: accept OTLP-format traces+logs+metrics from be-platform when we instrument it. For now this is a passthrough — traces just drop on the floor since we have no Tempo. Logs/metrics route appropriately.
- Faro → Loki: accept browser RUM events from the PWA. Faro events are JSON; ship them to Loki tagged
source=pwa-faro. (Faro adds a 5th port; if simpler, skip Faro for v1 and just have the PWA POST structured errors to GlitchTip.) - NixOS: No first-party module yet — install via tarball/binary + systemd unit, or use
virtualisation.oci-containerswith the officialgrafana/alloy:latestimage. Pin to a specific version.
5. GlitchTip (error tracking)¶
GlitchTip is a self-hosted, Sentry-SDK-compatible error tracker. Existing Sentry SDKs in be-platform / be-pwa work pointed at GlitchTip with no code changes. ~4 services vs Sentry's ~10. Postgres-backed instead of ClickHouse + Kafka.
- Web UI port:
8000(internal — needs a different external port since be-platform also uses 8000; map to8200) - Ingestion endpoint: Same port, paths
/api/<projectId>/envelope/and/api/<projectId>/store/ - Dependencies it brings:
- PostgreSQL — dedicated DB for glitchtip
- Redis — cache + Celery broker
- Celery worker — async event processing
- Celery beat — scheduled tasks
- Storage:
/var/lib/observability/glitchtip/postgres/(Postgres data dir)/var/lib/observability/glitchtip/uploads/(attachments, minidumps)- Retention: 90 days of events
- RAM: Postgres ~200MB + redis ~50MB + 2× Celery + web = ~800MB total. Allocate 1.5GB headroom.
- NixOS: No first-party module. Use
virtualisation.oci-containerswith the officialglitchtip/glitchtipimage. Pin a version. - Secrets needed (deploy via sops-nix or whatever you use today):
GLITCHTIP_SECRET_KEY— Django secret (generate 64 random bytes)GLITCHTIP_DATABASE_URL—postgres://glitchtip:<pw>@localhost:5433/glitchtip(separate port from any existing Postgres)GLITCHTIP_DEFAULT_FROM_EMAIL— from address for issue notifications- Email settings — reuse Resend (be-platform already uses it) or whatever SMTP is configured
- External access: Cloudflare route for
/api/*/envelope/and/api/*/store/only, pointed at meridian:8200. Admin UI stays Tailscale-only atmeridian.<tailnet>:8200/.
Resource budget for the whole stack¶
| Component | RAM | Disk |
|---|---|---|
| Loki | 200MB | 20GB |
| Prometheus | 300MB | 10GB |
| Grafana | 150MB | 1GB |
| Alloy | 200MB | — |
| GlitchTip (all) | 1.5GB | 10GB |
| Total | ~2.4GB | ~41GB |
Meridian currently has 16GB RAM per the project-instructions memo (the Dockerfile dep-cache notes "16384 MiB"). 2.4GB is fine.
Storage: confirm /var/lib/observability/ has 50GB+ available. If
meridian's root filesystem is tight, mount a dedicated volume.
Dependency / boot order¶
The systemd ordering should be roughly:
postgresql-glitchtip.service+redis-glitchtip.service(or whatever you name them)glitchtip-web.service+glitchtip-worker.service+glitchtip-beat.service— depend on (1)loki.serviceprometheus.servicealloy.service— depends on (3) + (4) being reachable (it'll retry, so soft dep is fine)grafana.service— depends on (3) + (4) being reachable
The two podman containers be-platform + be-pwa keep their existing
boot order — they don't depend on observability being up. Observability
should fail open.
Backup policy¶
/var/lib/observability/grafana/grafana.db— daily snapshot, retain 7 days. Contains dashboards + users./var/lib/observability/glitchtip/postgres/—pg_dumpdaily, retain 14 days./var/lib/observability/loki/— no backup. Logs are reproducible-ish from journald; not worth the disk./var/lib/observability/prometheus/— no backup. Same reasoning.
If meridian dies, dashboards + error history matter. Logs and raw metrics don't.
Application-side integration (informational — not your scope)¶
This is what the app-side tickets will do once the stack is up. Listed here so you understand which ports / paths need to exist.
be-platform:
- Will add a Prometheus metrics endpoint at /metrics on port 8000 (or whatever its existing Rocket port is). Prometheus scrape target.
- Will configure tracing/logging to either: (a) write structured JSON to stderr (Alloy reads via journald), or (b) ship OTLP to Alloy at localhost:4318. Pick (a) for v1.
- Will install the sentry crate pointed at GlitchTip's project DSN.
be-pwa:
- Will install @sentry/react pointed at GlitchTip's project DSN.
- May install Grafana Faro browser SDK pointed at Alloy's Faro endpoint (port 12347). Defer this until phase 2 if it's painful.
What I need from you¶
A nixOS PR (or branch on whatever repo manages meridian's config) that:
- Creates the persistent volumes under
/var/lib/observability/ - Adds the 5 services with the configs above
- Exposes them on the Tailscale interface only (except the GlitchTip event ingestion path)
- Provisions Grafana datasources via config (not UI)
- Wires up the backup cron jobs
- Sets up the secrets via your existing secrets path
- Returns:
- The URLs / Tailscale hostnames for Grafana, Loki, Prometheus, Alloy, GlitchTip
- The Grafana initial admin password (after first boot)
- The GlitchTip first-user signup URL (or seed an initial admin user)
- The Sentry DSN for the GlitchTip "be-platform" and "be-pwa" projects (after creating them in the UI)
- The Alloy OTLP HTTP endpoint (for be-platform to ship to)
- Confirmation of the Cloudflare route for
/api/*/envelope/→ meridian:8200
Once those are returned, the next step on the application side is two
tickets (one per app) wiring Sentry SDKs + Prometheus /metrics +
structured logging. Those land independently.
Versions to pin¶
Don't track latest for any of these. Initial pins (current stable as
of writing — update before deploy):
- Loki:
3.5.x - Prometheus:
2.55.x(or 3.x if 3.x has stabilized) - Grafana:
11.3.x - Alloy:
1.5.x - GlitchTip:
4.x
Out of scope for this delivery¶
- Mimir / Tempo — not now
- Loki object storage (S3 / MinIO) — not at this scale
- Alertmanager — phase 2; we'll wire alerts after we have dashboards worth alerting on
- PWA Faro RUM — phase 2 unless you find it trivial
- OAuth/SSO for Grafana — Tailscale is the perimeter; revisit if you ever expose Grafana publicly
- Long-term log shipping (e.g. to S3) — phase 2 if compliance ever requires it
One thing I'd appreciate¶
If you hit a service where the nixOS module is awkward or the container image misbehaves, flag it before forcing it. Two of these (Alloy, GlitchTip) are container-only or might have rough edges in nix. I'd rather know early than discover it after install.