ADR-074: Keycloak Scalability and High Availability Strategy¶

Category: architecture Provenance: human

Decision¶

Horizontal scaling of the managed Keycloak instance MUST be handled by the Keycloak application itself via proper clustering (JGroups/Infinispan), NOT by the Operator simply creating more unclustered replicas. For the current version of the Operator: 1. Operator-managed Keycloak instances are optimized for vertical scaling or simple Active-Standby (if supported). 2. For high-throughput horizontal scaling, users should configure an external Keycloak (managed outside the operator or by a dedicated Helm chart) that is properly clustered. 3. "Naive scaling" (increasing spec.replicas > 1 without clustering config) is explicitly unsupported for production as it leads to split-brain states (sessions not shared). NOTE: This decision has been superseded by ADR-075 which implements automatic JGroups DNS_PING configuration.

Rationale¶

Keycloak is a stateful application requiring distributed caching (Infinispan) to share sessions and user states across replicas. Simply launching multiple Pods (replicas) without this discovery mechanism results in isolated instances, breaking authentication flows (e.g., login on Pod A, code exchange on Pod B fails). While the Operator handles K8s resources, configuring robust JGroups discovery (DNS_PING, KUBE_PING) and cache tuning is a complex application-level concern that is currently best handled by dedicated Helm charts or external management for high-scale needs.

Agent Instructions¶

This decision has been SUPERSEDED by ADR-075. The operator now automatically configures JGroups DNS_PING discovery for all Keycloak instances. Users can simply set spec.replicas to scale horizontally. No manual clustering configuration is needed. See ADR-075 for current guidance.

Rejected Alternatives¶

Auto-scaling Replicas without Clustering¶

Leads to broken user experience (session loss) due to lack of state synchronization.

Operator automatically configuring JGroups¶

Adds significant complexity to the Operator logic. Better to delegate this to the underlying Helm chart or external configuration for now.