Building an Enterprise Platform on Bare Metal — Part 4: Observability with the LGTM Stack

You can’t manage what you can’t see. Our observability stack handles metrics, logs, traces, and continuous profiles for every workload on the cluster — multi-tenant, all on the same three NUCs as the things it observes, all S3-backed against an internal Ceph cluster.

The LGTM Stack

LGTM is Grafana’s open-source observability stack:

Mimir — long-term metrics storage, Prometheus-compatible, horizontally scalable, natively multi-tenant. Replaces a single-instance Prometheus with one that can actually grow.
Loki — log aggregation. Prometheus-style label-based querying, no full-text index, dramatically cheaper than Elasticsearch for the same retention.
Tempo — distributed tracing backend. Accepts OTLP, Jaeger, and Zipkin. Stores traces on object storage with a hashed span-id index.
Grafana — the front end. Single pane of glass for everything.

All four speak to the same S3-compatible bucket, served by Rook-Ceph on the cluster itself. No external S3, no cross-region egress, no surprise AWS bill.

Grafana Alloy: One Agent on Every Node

Alloy is Grafana’s unified telemetry collector — the successor to the Grafana Agent and a drop-in replacement for separate node-exporters, Promtail, and OTel collectors. One DaemonSet, one config, one thing to upgrade:

prometheus.scrape "kubelet" {
  targets    = discovery.kubernetes.nodes.targets
  forward_to = [prometheus.remote_write.mimir.receiver]
  scrape_interval = "30s"
}

loki.source.kubernetes "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [loki.write.loki.receiver]
}

otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }
  output { traces = [otelcol.exporter.otlp.tempo.input] }
}

Alloy runs as an ApplicationSet-managed DaemonSet — as soon as a new node joins the cluster, Alloy is on it, collecting, forwarding. No manual step.

Multi-Tenancy, in Practice

Mimir and Loki are configured for multi-tenancy from day one. Every team, every environment, and every noisy subsystem gets its own tenant ID, with isolated ingestion, storage, and query paths.

The tenant list lives in a single ConfigMap:

overrides:
  platform:
    ingestion_rate: 100000
    max_global_series_per_user: 5000000
    compactor_blocks_retention_period: 90d
  agents:
    ingestion_rate: 50000
    compactor_blocks_retention_period: 14d
  dev:
    ingestion_rate: 20000
    compactor_blocks_retention_period: 7d

What this buys us:

No noisy-neighbour problems — a chatty agent can’t starve platform metrics out of ingestion.
Per-tenant retention — we keep platform metrics for 90 days, agent telemetry for two weeks, dev for a week. Storage bill stays flat.
Tenant-scoped dashboards — Grafana data sources are pre-scoped to a tenant, so a dev dashboard can never accidentally query the whole cluster.

Loki’s tenant split mirrors Mimir’s exactly. We route by the X-Scope-OrgID header Alloy attaches based on a pod label.

Beyond Metrics and Logs

Three more pieces complete the picture:

Pyroscope — continuous CPU and memory profiling across the cluster. Every Go binary we build exports net/http/pprof; Alloy scrapes it the same way it scrapes Prometheus. When a service gets slow, we don’t try to reproduce — we go to Pyroscope and look at the flame graph from the incident window.

Sentry — error tracking with a ClickHouse backend. Every frontend, backend, and worker sends stack traces, breadcrumbs, and release markers. Integrated with Kargo so a new deploy appears as a release in Sentry automatically.

k8sgpt — AI-powered cluster diagnostics. Runs on a CronJob every ten minutes, scans the cluster for unhealthy objects, and posts plain-English explanations to Slack. It has caught misconfigured NetworkPolicies, failing readiness probes, and a PVC that had been pending for three hours before anyone looked at the namespace.

Storage: Everything on Ceph

Every byte of observability data lands on Rook-Ceph object storage running on the same three nodes:

Mimir blocks — compacted every two hours, uploaded to bucket mimir-blocks.
Loki chunks — flushed every five minutes, uploaded to bucket loki-chunks.
Tempo blocks — uploaded every 30 seconds for recent traces, compacted hourly.

Current storage sits at ~250 GB combined across all three backends, with the retention configuration above. Ceph’s replication factor of 3 means we pay for that capacity three times — which is fine at these volumes and the price of not having a SAN.

What We’d Do Differently

Turn on tenancy from day one. We ran single-tenant for the first three months “to keep things simple” and then had to migrate. Migration is annoying — new bucket paths, new data-source config in Grafana, backfill gymnastics. Start multi-tenant; you can collapse tenants later if you really need to.

Budget for Ceph. Replicated object storage is not free. Sizing the cluster for “metrics, logs, traces × 3 replicas × 90 days” is a real number; do it before you tell anyone the retention policy.

Alerts on the collector, not just the backends. A dead Alloy pod is a silent observability outage. We alert on Alloy’s own heartbeat series — if Alloy on node X stops reporting, we hear about it before the dashboards go dark.

Bare Metal K8s series: Part 1: Why · Part 2: Bootstrap · Part 3: GitOps · Part 4: Observability · Part 5: AI Platform

Cloud Native Solutions builds and operates Kubernetes platforms end-to-end. Talk to us if you want this for your team.