High Availability Deployment¶
This guide covers high-availability deployment for the operator, managed Keycloak, and PostgreSQL.
Use Helm values as the primary configuration path. Raw manifests remain useful for supporting resources such as PodDisruptionBudgets, but the managed Keycloak configuration itself should start from the chart values.
What HA Means Here¶
flowchart TD
ingress[Ingress]
opA[Operator Pod A\nactive leader]
opB[Operator Pod B\nstandby follower]
kc1[Keycloak Pod 1]
kc2[Keycloak Pod 2]
kc3[Keycloak Pod 3]
pgp[PostgreSQL Primary]
pgr1[PostgreSQL Replica 1]
pgr2[PostgreSQL Replica 2]
opA -. reconciles .-> kc1
opA -. reconciles .-> kc2
opA -. reconciles .-> kc3
opB -. ready to take over .-> opA
ingress --> kc1
ingress --> kc2
ingress --> kc3
kc1 --> pgp
kc2 --> pgp
kc3 --> pgp
pgp --> pgr1
pgp --> pgr2
High availability has three separate layers:
- operator availability through multiple operator replicas with leader election
- Keycloak availability through multiple managed Keycloak replicas
- database availability through a replicated PostgreSQL topology, ideally CNPG
Do not treat those as interchangeable.
- extra operator replicas improve control-plane availability, not Keycloak request capacity
- extra Keycloak replicas improve application availability and throughput
- database HA is mandatory if you want real failover instead of just more Keycloak pods
Managed Keycloak HA Basics¶
For managed Keycloak instances, the primary HA control is keycloak.replicas.
keycloak:
replicas: 3
ingress:
enabled: true
className: nginx
host: keycloak.example.com
annotations:
nginx.ingress.kubernetes.io/affinity: cookie
nginx.ingress.kubernetes.io/session-cookie-name: keycloak-affinity
nginx.ingress.kubernetes.io/session-cookie-hash: sha1
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
When replicas > 1, the operator automatically configures JGroups clustering for the managed Keycloak deployment:
- a headless discovery Service is created
KC_CACHE_STACK=kubernetesis configured- JGroups DNS_PING discovery is wired automatically
- port
7800is exposed for cluster communication
All currently supported Keycloak versions already satisfy the clustering stack requirement for this path.
Session affinity is still recommended at the ingress or load balancer layer even though distributed caching is enabled. It reduces avoidable churn during login flows and failover events.
Operator HA¶
Operator HA is configured separately from the managed Keycloak instance.
Use at least two operator replicas in production so reconciliation continues during a pod restart or node failure.
Operator HA is active-passive:
- one replica holds leadership and performs reconciliation work
- other replicas stay ready and take over if the leader disappears
- adding replicas improves availability, not reconciliation throughput
Database HA With CNPG¶
CNPG is the strongest HA path because it gives you failover, backups, and recovery primitives that the operator can integrate with.
Typical production-oriented values:
keycloak:
database:
type: postgresql
cnpg:
enabled: true
clusterName: keycloak-postgres
instances: 3
storage:
size: 100Gi
storageClass: fast-ssd
postgresql:
maxConnections: "200"
sharedBuffers: "512MB"
Key CNPG concepts:
instances: 3gives one primary and two replicasminSyncReplicasandmaxSyncReplicasare CNPG-level durability controls when you manage the database cluster directly- connection limits must be sized for the Keycloak replica count, admin activity, and background jobs
For full CNPG examples, use Database Setup.
Resource Sizing Guidance¶
Replica count, JVM memory, and database capacity move together.
Practical rules:
- raising
keycloak.replicasincreases total database connection demand - larger realms and login bursts usually need both more CPU and more heap
- ingress or load-balancer stickiness reduces cross-pod cache chatter during hot paths
- do not raise Keycloak replicas without checking PostgreSQL connection capacity
A reasonable starting point for production is:
- Keycloak:
3replicas,500mCPU request,1Gimemory request - CNPG:
3instances, storage sized for backups and WAL growth, connection limits reviewed for the expected concurrency
Tune from there with real metrics, not cargo-cult numbers.
Upgrade Strategy For HA¶
If you want the lowest-disruption upgrade path, configure blue-green upgrades.
keycloak:
replicas: 3
cacheIsolation:
autoRevision: true
upgradePolicy:
strategy: BlueGreen
backupTimeout: 600
autoTeardown: true
This gives you:
- pre-upgrade backups for supported database tiers
- isolated JGroups cluster identity during the cutover
- traffic switch only after the green deployment is ready
See Migration & Upgrade Guide.
Scheduling And Disruption Controls¶
The managed Keycloak CR does not expose arbitrary pod affinity or anti-affinity fields today. Do not document unsupported affinity examples as if they were part of the CRD.
For disruption control:
- use cluster scheduling policy and node topology consciously
- manage PodDisruptionBudgets as separate GitOps-managed manifests if you need them
- validate that your ingress or load balancer is distributing traffic the way you expect
Example PDB managed alongside the Helm release:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: keycloak-pdb
namespace: keycloak-system
spec:
minAvailable: 2
selector:
matchLabels:
app.kubernetes.io/name: keycloak
Failover Validation¶
Keycloak Pod Failure¶
kubectl get pods -n keycloak-system
kubectl delete pod -n keycloak-system <one-keycloak-pod>
kubectl get pods -n keycloak-system -w
What to verify:
- replacement pod becomes ready
- ingress still serves login requests during the disruption
- an active session or repeated token request continues to work
CNPG Primary Failover¶
kubectl get cluster keycloak-postgres -n keycloak-system \
-o jsonpath='{.status.currentPrimary}{"\n"}'
kubectl delete pod -n keycloak-system <current-primary-pod>
kubectl get cluster keycloak-postgres -n keycloak-system -w
What to verify:
- a new primary is elected
- Keycloak reconnects automatically
- authentication and token issuance recover without manual reconfiguration
Traffic Continuity Check¶
A useful HA test is not just pod replacement. Verify a real flow:
- authenticate through the ingress
- keep issuing token refreshes or authenticated requests
- delete a Keycloak pod or trigger CNPG failover
- confirm the client experience remains acceptable
If you only watch pod status, you are testing Kubernetes cosmetics, not service continuity.
Monitoring And Alerting¶
Start with these checks:
kubectl get keycloak -n keycloak-system
kubectl get pods -n keycloak-system
kubectl get cluster -n keycloak-system
Useful Prometheus queries:
sum(up{job="keycloak"}) / count(up{job="keycloak"})
max(cnpg_pg_replication_lag_seconds) by (pod)
rate(kube_pod_container_status_restarts_total{namespace="keycloak-system"}[1h])
Reasonable starting alert thresholds:
- availability below
1for the managed Keycloak target set - CNPG replication lag consistently above
1sfor latency-sensitive setups - restart rates that are non-zero for a sustained window instead of a single rollout blip
Tune the thresholds to your own traffic and SLOs.
S3 Backup Prerequisites¶
If you use CNPG object-store backups, verify the prerequisites before treating them as HA protection:
- object storage bucket exists and is reachable
- credentials secret exists in the CNPG namespace
- lifecycle or retention rules match your recovery objectives
- restore procedures have been exercised, not just configured
Backups you have never restored are optimism with YAML attached.