The hardest part of managing 84 applications isn’t deploying them. It’s keeping them from turning into a dumpster fire of configuration drift, secret sprawl, and “I swear I only changed one thing.”
This is how we structured a single GitOps repository that manages everything on the cluster — from Istio mesh configs to AI agent deployments — without losing our minds.
The Repo Structure
Forget flat directories. At 84 apps you need hierarchy, or you drown in a ls of apps/:
gitops/
├── bootstrap/ # Tinkerbell + CAPI + Ansible (Day 0)
├── catalog/
│ └── helm-charts/ # 9 custom charts we maintain
├── environments/
│ └── mgmt/ # Per-cluster values
├── mgmt-cluster/
│ └── apps/
│ ├── agentic/ # 21 apps — 18 AI agents + LiteLLM + Langfuse + Open WebUI
│ ├── devops/ # 12 apps — Argo CD, Argo Workflows, Harbor, Gitea, Kargo
│ ├── observability/ # 12 apps — Grafana, Mimir, Loki, Tempo, Sentry, Pyroscope
│ ├── security/ # 7 apps — Vault, Falco, Kyverno, External Secrets
│ ├── identity/ # 4 apps — Keycloak, CloudNativePG, Bank-Vaults
│ ├── infrastructure/ # 8 apps — Traefik, cert-manager, MetalLB, Postfix
│ ├── service-mesh/ # 4 apps — Istio, Kiali
│ ├── networking/ # 3 apps — Tailscale
│ ├── storage/ # 2 apps — Rook-Ceph operator + cluster
│ ├── collaboration/ # 3 apps — Nextcloud, Synapse, n8n
│ ├── resilience/ # 3 apps — VPA, descheduler, kured
│ └── backstage/ # 1 app — Developer portal
└── scripts/ # Operational tooling
That directory tree covers 80 applications. The remaining four are cluster-wide DaemonSets deployed through Argo CD ApplicationSets — Alloy, kube-state-metrics, metrics-server, and the Prometheus CRDs — which don’t have a natural home in any single category. Total: 84.
Every category has a kustomization.yaml that lists its Argo CD Application manifests. A root app-of-apps points to all categories. Argo CD recurses from the top and discovers everything automatically.
A Real Argo CD Application
Here is what an app actually looks like in the repo — not a tutorial example, but how we deploy LiteLLM:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: litellm
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/BerriAI/litellm
targetRevision: litellm-helm-v6.24.6
path: deploy/charts/litellm-helm
helm:
valueFiles:
- $values/environments/mgmt/agentic/litellm-values.yaml
destination:
server: https://kubernetes.default.svc
namespace: litellm
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
Three decisions in there earned their place the hard way:
selfHeal: trueon every app. Non-negotiable. We learned this when someone rankubectl set resourceson Velero at 1 AM and Argo CD reverted it in three seconds. Annoying in the moment, lifesaving long-term.ServerSideApply=truefor apps with large CRDs (Istio, cert-manager, Traefik). Without it, thelast-applied-configurationannotation blows past the 256 KB limit and the sync fails with a cryptic error about annotation size.CreateNamespace=true— Argo CD owns namespace creation. No pre-creating namespaces by hand, no drift between cluster state and repo state.
Nine Custom Helm Charts
Not everything has a good upstream chart. We maintain our own for:
| Chart | Why |
|---|---|
capt-cluster | Cluster API + Tinkerbell cluster definition |
mission-control | Internal project tracker (Convex + Next.js) |
devops-ai-web | DevOps AI platform frontend |
o8s-cloner | Infrastructure cloning tool |
o8s-agents | AI agent fleet deployment |
synapse | Matrix homeserver with our SSO + federation config |
ops-dashboard | Operations dashboard |
postfix | SMTP relay (Postfix + iCloud SASL auth) |
common | Shared templates and helpers used by the others |
Each chart lives in catalog/helm-charts/ and is referenced by Argo CD apps. We version them via git tags and use Argo CD’s multi-source feature to combine upstream charts with our custom values.
The Velero Incident: Why Self-Heal Matters
2 AM. Velero’s node agent OOMKills during the nightly backup — second night in a row. The fix is obvious: bump memory from 512 Mi to 1 Gi.
The wrong way — what we tried first:
kubectl set resources deployment/velero -n velero --limits=memory=1Gi
Argo CD reverted it in three seconds flat. The deployment rolled out twice — once for our change, once for Argo CD’s revert. Net effect: nothing, plus two rollouts worth of log noise.
The right way:
vim environments/mgmt/cluster-management/velero-values.yaml
# nodeAgent.resources.limits.memory: 1Gi
git commit -m "fix(velero): bump nodeAgent memory 512Mi→1Gi"
git push
Argo CD synced within thirty seconds. Fixed permanently. Tracked in git history. No more OOMKills.
This is the entire point of GitOps. The pain of “I can’t just kubectl it” pays off every single time something breaks at 2 AM and you need to know exactly what changed.
Kargo: Promotion Pipelines
We use Kargo for version promotion across environments. When Harbor builds a new container image:
- Kargo detects the new tag via a
Warehousesubscription. - It opens a
Promotionthat rewrites the tag in the GitOps values file. - Argo CD syncs the new version into the target
Stage. - If health checks fail, Kargo blocks further promotion and rolls back.
Kargo is still pre-1.0 and has sharp edges, but it already handles our most tedious workflow: “a new image got pushed, now update three values files and make sure nothing breaks.”
What We’d Do Differently
Start with app-of-apps from day one. We initially created Argo CD apps through the UI. By app #15 it was chaos. Migrating to app-of-apps retroactively meant exporting every Application as YAML, importing existing resources, and reconciling drift — a weekend that should have been avoided.
Use ApplicationSets for repeated patterns. We have four ApplicationSets for things deployed identically to every node (Alloy, kube-state-metrics, metrics-server, Prometheus CRDs). We should have more — anything that fans out over a list of clusters, namespaces, or teams is an ApplicationSet in waiting.
Pin Helm chart versions aggressively. We had an incident where an unpinned upstream chart auto-updated and broke Istio’s mesh config. Now every targetRevision is an explicit tag or SHA — no HEAD, no main, no *.
Keep values files small and environment-scoped. The first pass put everything in one values.yaml per app. By month three the files were 400 lines of half-conditional templating. Splitting into common-values.yaml + environments/<env>/<app>-values.yaml made diffs readable again.
Bare Metal K8s series: Part 1: Why · Part 2: Bootstrap · Part 3: GitOps · Part 4: Observability · Part 5: AI Platform
Cloud Native Solutions builds and operates Kubernetes platforms end-to-end. Talk to us if you want this for your team.
