High Availability Deployment¶
Configure Keycloak and PostgreSQL for high availability with automatic failover.
Architecture Overview¶
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#00b8d9','primaryTextColor':'#fff','primaryBorderColor':'#0097a7','lineColor':'#00acc1','secondaryColor':'#006064','tertiaryColor':'#fff'}}}%%
graph TD
ingress["⚡ Ingress"]
kc1["🔐 Keycloak<br/>Pod 1"]
kc2["🔐 Keycloak<br/>Pod 2"]
kc3["🔐 Keycloak<br/>Pod 3"]
pg_primary["🗄️ PostgreSQL<br/>Primary"]
pg_replica1["🗄️ PostgreSQL<br/>Replica 1"]
pg_replica2["🗄️ PostgreSQL<br/>Replica 2"]
ingress --> kc1
ingress --> kc2
ingress --> kc3
kc1 --> pg_primary
kc2 --> pg_primary
kc3 --> pg_primary
pg_primary --> pg_replica1
pg_primary --> pg_replica2
style ingress fill:#00838f,stroke:#006064,color:#fff
style kc1 fill:#00838f,stroke:#006064,color:#fff
style kc2 fill:#00838f,stroke:#006064,color:#fff
style kc3 fill:#00838f,stroke:#006064,color:#fff
style pg_primary fill:#1565c0,stroke:#0d47a1,color:#fff
style pg_replica1 fill:#1565c0,stroke:#0d47a1,color:#fff
style pg_replica2 fill:#1565c0,stroke:#0d47a1,color:#fff
Quick Start: HA Configuration¶
Keycloak HA¶
apiVersion: vriesdemichael.github.io/v1
kind: Keycloak
metadata:
name: keycloak
namespace: keycloak-system
spec:
replicas: 3 # HA: minimum 3 replicas
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
# Pod anti-affinity: spread across nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: keycloak
topologyKey: kubernetes.io/hostname
database:
type: cnpg
cluster: keycloak-db
namespace: keycloak-db
ingress:
enabled: true
className: nginx
hostname: keycloak.example.com
PostgreSQL HA¶
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: keycloak-db
namespace: keycloak-db
spec:
instances: 3 # 1 primary + 2 replicas
primaryUpdateStrategy: unsupervised # Automatic failover
minSyncReplicas: 1
maxSyncReplicas: 2
# Spread across zones
affinity:
podAntiAffinityType: required
topologyKey: topology.kubernetes.io/zone
storage:
size: 100Gi
storageClass: fast-ssd
backup:
barmanObjectStore:
destinationPath: s3://backups/keycloak-db
s3Credentials:
accessKeyId:
name: backup-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: backup-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
data:
compression: gzip
retentionPolicy: "30d"
Component Configuration¶
1. Operator HA¶
helm install keycloak-operator ./charts/keycloak-operator \
--namespace keycloak-operator-system \
--create-namespace \
--set replicas=2 \
--set resources.requests.cpu=200m \
--set resources.requests.memory=512Mi
2. Load Balancer Configuration¶
Ingress (Recommended):
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/affinity: cookie
nginx.ingress.kubernetes.io/session-cookie-name: keycloak-affinity
nginx.ingress.kubernetes.io/session-cookie-hash: sha1
hostname: keycloak.example.com
Service LoadBalancer (Cloud):
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
3. Pod Disruption Budget¶
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: keycloak-pdb
namespace: keycloak-system
spec:
minAvailable: 2
selector:
matchLabels:
app: keycloak
4. Resource Limits¶
Keycloak:
PostgreSQL:
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
postgresql:
parameters:
max_connections: "200"
shared_buffers: "512MB"
effective_cache_size: "2GB"
Failover Testing¶
Simulate Keycloak Pod Failure¶
# Kill a Keycloak pod
kubectl delete pod keycloak-0 -n keycloak-system
# Watch rollout
kubectl get pods -n keycloak-system -w
# Verify traffic continues (2/3 pods handling requests)
# New pod starts automatically
Simulate Database Primary Failure¶
# Identify primary
kubectl get cluster keycloak-db -n keycloak-db \
-o jsonpath='{.status.currentPrimary}'
# Delete primary pod
kubectl delete pod keycloak-db-1 -n keycloak-db
# Watch failover (30-60 seconds)
kubectl get cluster keycloak-db -n keycloak-db -w
# Verify new primary elected
kubectl get cluster keycloak-db -n keycloak-db \
-o jsonpath='{.status.currentPrimary}'
Simulate Node Failure¶
# Cordon node (simulates failure)
kubectl cordon <node-name>
# Watch pods reschedule
kubectl get pods -n keycloak-system -o wide -w
# Uncordon when done
kubectl uncordon <node-name>
Monitoring HA Health¶
Check Keycloak Replicas¶
kubectl get pods -n keycloak-system -l app=keycloak
# All pods should be Running
# Check endpoints
kubectl get endpoints keycloak-keycloak -n keycloak-system
# Should show all pod IPs
Check Database Replication¶
# Check cluster status
kubectl get cluster keycloak-db -n keycloak-db
# Check replication lag
kubectl exec -it -n keycloak-db keycloak-db-1 -- \
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
Prometheus Queries¶
# Keycloak availability (should be 1.0 = 100%)
sum(up{job="keycloak"}) / count(up{job="keycloak"})
# Database replication lag (should be < 1s)
max(cnpg_pg_replication_lag_seconds) by (pod)
# Pod restart rate (should be near 0)
rate(kube_pod_container_status_restarts_total{namespace="keycloak-system"}[1h])
Zone Distribution¶
Multi-Zone Keycloak¶
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: keycloak
topologyKey: topology.kubernetes.io/zone # Spread across zones
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
Multi-Zone Database¶
affinity:
podAntiAffinityType: required
topologyKey: topology.kubernetes.io/zone
nodeSelector:
node.kubernetes.io/instance-type: m5.2xlarge
Performance Tuning¶
Connection Pooling¶
Keycloak Database Pool:
env:
- name: KC_DB_POOL_INITIAL_SIZE
value: "10"
- name: KC_DB_POOL_MIN_SIZE
value: "10"
- name: KC_DB_POOL_MAX_SIZE
value: "50"
PostgreSQL Connections:
Caching¶
env:
- name: KC_CACHE
value: "ispn" # Infinispan (default, supports clustering)
- name: KC_CACHE_STACK
value: "kubernetes" # Auto-discovery in K8s
Disaster Recovery¶
Backup Strategy¶
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: keycloak-db-backup
namespace: keycloak-db
spec:
schedule: "0 2 * * *" # Daily at 2 AM
backupOwnerReference: self
cluster:
name: keycloak-db
Multi-Region Backup¶
backup:
barmanObjectStore:
destinationPath: s3://backups-us-east/keycloak-db
# Replicate to secondary region
# Then replicate S3 bucket to other region via AWS S3 replication
Recovery Time Objective (RTO)¶
| Component | Failure | RTO | Procedure |
|---|---|---|---|
| Keycloak Pod | Single pod failure | < 1 min | Automatic (K8s recreates) |
| Keycloak Node | Node failure | 2-5 min | Automatic (pods reschedule) |
| Database Primary | Primary failure | 30-60s | Automatic (CNPG failover) |
| Database Corruption | Data corruption | 10-30 min | Manual restore from backup |
| Full Cluster Loss | Region outage | 1-4 hours | Manual restore in new region |
Troubleshooting HA¶
Split Brain Detection¶
# Check if multiple primaries exist (should never happen)
kubectl get pods -n keycloak-db -l role=primary
# Should show exactly 1 pod
# Check cluster status
kubectl describe cluster keycloak-db -n keycloak-db
Replication Lag Too High¶
# Check lag
kubectl exec -it -n keycloak-db keycloak-db-1 -- \
psql -U postgres -c "
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
sync_state,
EXTRACT(EPOCH FROM (now() - replay_timestamp)) AS lag_seconds
FROM pg_stat_replication;
"
# Solutions:
# - Increase wal_sender_timeout
# - Check network latency between pods
# - Verify replica has sufficient resources
Session Affinity Issues¶
# Verify ingress session affinity
kubectl get ingress keycloak-ingress -n keycloak-system -o yaml | grep affinity
# Test with curl
curl -v -c cookies.txt https://keycloak.example.com/realms/master
curl -v -b cookies.txt https://keycloak.example.com/realms/master
# Should hit same pod (check response headers/logs)