Observability stack on meridian — nixOS module spec¶

This document specifies the services to add to meridian's nixOS config to support observability for be-platform (Rust + Rocket + SQLite) and be-pwa (TypeScript + React + Vite, served as static files), both of which already run as podman containers on meridian.

The audience is the nixOS-config agent. Application-side instrumentation is the responsibility of separate tickets and is described here only as context.

Goal¶

Five services that together cover:

Logs (Loki) — structured logs from be-platform + access logs from the PWA's reverse proxy
Metrics (Prometheus) — backend RPS / latency / error rate / SQLite size, plus PWA web-vitals
Visualization (Grafana) — dashboards for the above + Loki/Prometheus query UI
Collector (Grafana Alloy) — single agent on meridian that reads journald, scrapes Prometheus targets, accepts OTLP + Faro from the PWA, ships to Loki + Prometheus
Error tracking (GlitchTip) — Sentry-SDK-compatible self-hosted error tracking for both apps

We are explicitly NOT deploying Mimir or Tempo for now. They can be added later if Prometheus retention becomes a bottleneck (unlikely at this scale) or distributed tracing becomes necessary (also unlikely at this scale).

Access model — Tailscale-only¶

All five services bind to the Tailscale interface only. No Cloudflare, no public DNS. Reachable via <service>.meridian.<tailnet-suffix> from machines on the tailnet.

Exception worth considering: the GlitchTip web UI and event ingestion endpoint may need to be public-ish so the deployed PWA running on users' phones can POST errors. Two options:

(a) Expose only the GlitchTip event ingestion endpoint via Cloudflare (the rest of GlitchTip — admin UI, settings — stays Tailscale-only)
(b) Run a thin proxy in be-platform that accepts events from the PWA and forwards to GlitchTip's internal endpoint

Default to (a). Less code to write, no proxy to maintain. Note this in the Cloudflare config — only the /api/<n>/store/ style ingestion paths get exposed.

Services¶

1. Loki¶

Port: 3100
Mode: Single-binary (filesystem storage, no S3/MinIO)
Storage volume: /var/lib/observability/loki/ (chunks + index + WAL)
Retention: 30 days
Expected disk: ~5GB after 30 days at current log volume; size for 20GB to be safe
RAM: ~200MB
Config knobs:
auth_enabled: false (Tailscale ACL is the auth boundary)
Single tenant
compactor enabled for retention enforcement
NixOS module: services.loki exists in nixpkgs, prefer it over container

2. Prometheus¶

Port: 9090
Storage volume: /var/lib/observability/prometheus/ (TSDB)
Retention: 90 days
Expected disk: ~3GB at 90 days for a 1-binary scrape target; size for 10GB
RAM: ~300MB
Scrape targets (managed via NixOS config — list will grow):
be-platform (host: localhost, port: TBD — backend agent will expose /metrics; default 8000/metrics per Rocket convention)
node-exporter (already running? add if not — port 9100)
alloy self-metrics (port 12345/metrics)
Scrape interval: 15s default
NixOS module: services.prometheus

3. Grafana¶

Port: 3000
Storage volume: /var/lib/observability/grafana/ (sqlite-backed config, dashboards, users)
Auth: Anonymous viewer disabled. Admin password from secrets store on first boot. Add a single shared "operator" user for now (Jeremy + future ops). Don't wire OAuth yet — Tailscale is the perimeter.
Datasources (provisioned via NixOS config, not the UI):
Loki at http://localhost:3100
Prometheus at http://localhost:9090
Dashboards: None to provision at install time. The application-side ticket will add starter dashboards as JSON in /var/lib/observability/grafana/provisioned/.
RAM: ~150MB
NixOS module: services.grafana — set provision.datasources.settings

4. Grafana Alloy (collector)¶

Port: 12345 (admin/self-metrics), 4317 (OTLP gRPC), 4318 (OTLP HTTP), 12347 (Faro frontend telemetry)
Storage: None required (stateless collector)
Config file: /etc/alloy/config.alloy
Pipelines needed:
journald → Loki: read podman-balanced-engineering-platform.service + podman-balanced-engineering-pwa.service (or whatever the be-pwa service is named), tag with app, host, service_unit, send to Loki
OTLP → Loki/Prometheus: accept OTLP-format traces+logs+metrics from be-platform when we instrument it. For now this is a passthrough — traces just drop on the floor since we have no Tempo. Logs/metrics route appropriately.
Faro → Loki: accept browser RUM events from the PWA. Faro events are JSON; ship them to Loki tagged source=pwa-faro. (Faro adds a 5th port; if simpler, skip Faro for v1 and just have the PWA POST structured errors to GlitchTip.)
NixOS: No first-party module yet — install via tarball/binary + systemd unit, or use virtualisation.oci-containers with the official grafana/alloy:latest image. Pin to a specific version.

5. GlitchTip (error tracking)¶

GlitchTip is a self-hosted, Sentry-SDK-compatible error tracker. Existing Sentry SDKs in be-platform / be-pwa work pointed at GlitchTip with no code changes. ~4 services vs Sentry's ~10. Postgres-backed instead of ClickHouse + Kafka.

Web UI port: 8000 (internal — needs a different external port since be-platform also uses 8000; map to 8200)
Ingestion endpoint: Same port, paths /api/<projectId>/envelope/ and /api/<projectId>/store/
Dependencies it brings:
PostgreSQL — dedicated DB for glitchtip
Redis — cache + Celery broker
Celery worker — async event processing
Celery beat — scheduled tasks
Storage:
/var/lib/observability/glitchtip/postgres/ (Postgres data dir)
/var/lib/observability/glitchtip/uploads/ (attachments, minidumps)
Retention: 90 days of events
RAM: Postgres ~200MB + redis ~50MB + 2× Celery + web = ~800MB total. Allocate 1.5GB headroom.
NixOS: No first-party module. Use virtualisation.oci-containers with the official glitchtip/glitchtip image. Pin a version.
Secrets needed (deploy via sops-nix or whatever you use today):
GLITCHTIP_SECRET_KEY — Django secret (generate 64 random bytes)
GLITCHTIP_DATABASE_URL — postgres://glitchtip:<pw>@localhost:5433/glitchtip (separate port from any existing Postgres)
GLITCHTIP_DEFAULT_FROM_EMAIL — from address for issue notifications
Email settings — reuse Resend (be-platform already uses it) or whatever SMTP is configured
External access: Cloudflare route for /api/*/envelope/ and /api/*/store/ only, pointed at meridian:8200. Admin UI stays Tailscale-only at meridian.<tailnet>:8200/.

Resource budget for the whole stack¶

Component	RAM	Disk
Loki	200MB	20GB
Prometheus	300MB	10GB
Grafana	150MB	1GB
Alloy	200MB	—
GlitchTip (all)	1.5GB	10GB
Total	~2.4GB	~41GB

Meridian currently has 16GB RAM per the project-instructions memo (the Dockerfile dep-cache notes "16384 MiB"). 2.4GB is fine.

Storage: confirm /var/lib/observability/ has 50GB+ available. If meridian's root filesystem is tight, mount a dedicated volume.

Dependency / boot order¶

The systemd ordering should be roughly:

postgresql-glitchtip.service + redis-glitchtip.service (or whatever you name them)
glitchtip-web.service + glitchtip-worker.service + glitchtip-beat.service — depend on (1)
loki.service
prometheus.service
alloy.service — depends on (3) + (4) being reachable (it'll retry, so soft dep is fine)
grafana.service — depends on (3) + (4) being reachable

The two podman containers be-platform + be-pwa keep their existing boot order — they don't depend on observability being up. Observability should fail open.

Backup policy¶

/var/lib/observability/grafana/grafana.db — daily snapshot, retain 7 days. Contains dashboards + users.
/var/lib/observability/glitchtip/postgres/ — pg_dump daily, retain 14 days.
/var/lib/observability/loki/ — no backup. Logs are reproducible-ish from journald; not worth the disk.
/var/lib/observability/prometheus/ — no backup. Same reasoning.

If meridian dies, dashboards + error history matter. Logs and raw metrics don't.

Application-side integration (informational — not your scope)¶

This is what the app-side tickets will do once the stack is up. Listed here so you understand which ports / paths need to exist.

be-platform: - Will add a Prometheus metrics endpoint at /metrics on port 8000 (or whatever its existing Rocket port is). Prometheus scrape target. - Will configure tracing/logging to either: (a) write structured JSON to stderr (Alloy reads via journald), or (b) ship OTLP to Alloy at localhost:4318. Pick (a) for v1. - Will install the sentry crate pointed at GlitchTip's project DSN.

be-pwa: - Will install @sentry/react pointed at GlitchTip's project DSN. - May install Grafana Faro browser SDK pointed at Alloy's Faro endpoint (port 12347). Defer this until phase 2 if it's painful.

What I need from you¶

A nixOS PR (or branch on whatever repo manages meridian's config) that:

Creates the persistent volumes under /var/lib/observability/
Adds the 5 services with the configs above
Exposes them on the Tailscale interface only (except the GlitchTip event ingestion path)
Provisions Grafana datasources via config (not UI)
Wires up the backup cron jobs
Sets up the secrets via your existing secrets path
Returns:
The URLs / Tailscale hostnames for Grafana, Loki, Prometheus, Alloy, GlitchTip
The Grafana initial admin password (after first boot)
The GlitchTip first-user signup URL (or seed an initial admin user)
The Sentry DSN for the GlitchTip "be-platform" and "be-pwa" projects (after creating them in the UI)
The Alloy OTLP HTTP endpoint (for be-platform to ship to)
Confirmation of the Cloudflare route for /api/*/envelope/ → meridian:8200

Once those are returned, the next step on the application side is two tickets (one per app) wiring Sentry SDKs + Prometheus /metrics + structured logging. Those land independently.

Versions to pin¶

Don't track latest for any of these. Initial pins (current stable as of writing — update before deploy):

Loki: 3.5.x
Prometheus: 2.55.x (or 3.x if 3.x has stabilized)
Grafana: 11.3.x
Alloy: 1.5.x
GlitchTip: 4.x

Out of scope for this delivery¶

Mimir / Tempo — not now
Loki object storage (S3 / MinIO) — not at this scale
Alertmanager — phase 2; we'll wire alerts after we have dashboards worth alerting on
PWA Faro RUM — phase 2 unless you find it trivial
OAuth/SSO for Grafana — Tailscale is the perimeter; revisit if you ever expose Grafana publicly
Long-term log shipping (e.g. to S3) — phase 2 if compliance ever requires it

One thing I'd appreciate¶

If you hit a service where the nixOS module is awkward or the container image misbehaves, flag it before forcing it. Two of these (Alloy, GlitchTip) are container-only or might have rough edges in nix. I'd rather know early than discover it after install.