Building an Enterprise Platform on Bare Metal — Part 4: Observability with the LGTM Stack

You can’t manage what you can’t see. Our observability stack handles metrics, logs, traces, and profiling — all multi-tenant, all on the same 3 NUCs.

The LGTM Stack

LGTM stands for Loki, Grafana, Tempo, Mimir — Grafana’s open source observability stack:

Mimir — Long-term metrics storage, Prometheus-compatible. Replaces standalone Prometheus with a scalable, multi-tenant backend.
Loki — Log aggregation. Like Prometheus, but for logs. Label-based querying without indexing full log content.
Tempo — Distributed tracing backend. Stores traces from OpenTelemetry, Jaeger, or Zipkin.
Grafana — Visualization. Single pane of glass for everything.

Grafana Alloy: The Universal Collector

Alloy is Grafana’s unified telemetry collector. It replaces the need for separate Prometheus node-exporters, log scrapers, and trace collectors. One agent collects everything:

Metrics from Kubernetes API, nodes, and applications
Logs from container stdout/stderr and system journals
Traces via OpenTelemetry protocol
Profiles via Pyroscope integration

We deploy Alloy as a DaemonSet via ApplicationSet — it automatically runs on every node.

Multi-Tenancy

Mimir and Loki are configured for multi-tenancy. Each team or application gets its own tenant ID, with isolated storage and query paths. This means:

No noisy neighbor problems
Per-tenant retention policies
Tenant-scoped dashboards in Grafana

Beyond the Basics

We also run:

Pyroscope — Continuous profiling. Find CPU and memory hotspots without reproducing issues.
Sentry — Error tracking with ClickHouse backend. Stack traces, breadcrumbs, and release tracking.
k8sgpt — AI-powered cluster diagnostics. Points out issues before they become incidents.

Storage Considerations

All observability data is backed by Ceph object storage (via Rook). Mimir, Loki, and Tempo write to S3-compatible buckets on the same Ceph cluster that serves our persistent volumes. No external S3 needed.

Current usage: ~250GB for metrics, logs, and traces combined, with configurable retention.

Next up: Part 5 — Running an AI agent platform on Kubernetes.