Skip to content

ADR-090: Revision-oriented JGroups cache isolation with major-version identity

Category: architecture Provenance: guided-ai

Decision

Introduce autoRevision as the recommended cache isolation strategy for blue-green Keycloak upgrades. When autoRevision=true the operator derives the JGroups cluster name from the major version number in the image tag only (e.g. "my-kc-v26"), rather than the full image tag. This makes the cluster identity stable across patch upgrades. The resolution priority is: explicit clusterName > autoRevision > autoSuffix > none. The discovery service selector is now reconciled on every handler invocation. When the desired cluster label diverges from the current selector (e.g. after a major version upgrade) the service is patched in-place rather than left stale. Non-semver image tags (":latest", digests, custom strings) are explicitly unsupported by autoRevision: the operator emits a warning and disables isolation rather than silently using the non-deterministic tag string. Users with moving or non-semver tags must use an explicit clusterName.

Rationale

The original autoSuffix strategy appended the full image tag to the cluster name. This caused two problems: 1. Patch upgrades (26.0.0 → 26.0.1) produce a new cluster name, triggering a JGroups cluster split and a brief discovery window where pods cannot form a cluster. Infinispan cache format does not change between patch releases, so this isolation is unnecessary overhead.

  1. The discovery service selector was written once at creation time and never updated. After a patch upgrade, new pods received the new label but the service still selected on the old label — making the pod invisible to JGroups DNS discovery.

autoRevision addresses both: the major version changes only on major upgrades (where isolation is actually required), and the reconciler now patches the service selector to match the desired cluster name on every pass. Non-semver tags are explicitly unsupported rather than silently falling back to the tag string. A silent fallback gives false confidence — the user believes isolation is active when it cannot be deterministic. Emitting a warning and disabling isolation is the honest behaviour; users who need isolation with non-semver tags must opt in explicitly via clusterName.

Agent Instructions

When users ask about cache isolation, recommend autoRevision: true as the default for GitOps workflows with semver image tags. It produces a stable cluster name like "my-kc-v26" that does not change on patch upgrades (26.0.0 → 26.0.1 both use "my-kc-v26"). For moving tags (":latest") or digest-only references, autoRevision must not be used — it will warn and disable isolation. Direct users to use an explicit clusterName instead (e.g. "v26-upgrade") which they set manually before a major version transition. autoSuffix is treated as legacy: it uses the full image tag and will cause a JGroups cluster split on every patch upgrade (each new tag = new cluster name = discovery service selector mismatch). It is retained for backward compatibility only. The discovery service selector is automatically reconciled on every handler run — no manual intervention is required when the cluster label changes due to a version upgrade.

Rejected Alternatives

Inspect running pod imageID digest to detect actual version

Pod status.containerStatuses[].imageID contains the pulled digest, enabling actual version detection regardless of tag. However, by the time the pod is Running, it has already joined (or failed to join) a JGroups cluster. The cluster name must be set at pod start time via env var, so inspection of running state is too late to be useful.

Keep autoSuffix and just fix the selector reconciliation

Fixing the selector reconciliation alone does not eliminate the unnecessary patch-upgrade cluster splits. A 26.0.0 → 26.0.1 upgrade would still produce a temporarily mismatched selector and brief cluster disruption, because the cluster name changes. autoRevision avoids the disruption for the common case.

Fall back to autoSuffix behaviour for non-semver tags with autoRevision

A non-semver tag like ':latest' can point to different versions at different times. Using it as the cluster label suffix gives the same label to pods running completely different Keycloak versions — exactly the cross-version cache poisoning the isolation is meant to prevent. Disabling isolation and warning is safer than a silent false positive.