Drift Detection¶

The operator can periodically compare Keycloak state with the Kubernetes resources that are supposed to own it.

Drift detection helps with:

orphaned resources left behind after CR deletion
unmanaged resources that were not created by any operator instance
configuration drift between a CR and the live Keycloak object

How It Works¶

Drift detection runs as a background timer inside the operator.

It:

scans managed realms and clients
checks whether the corresponding CR still exists
evaluates whether the live Keycloak state has drifted from the CR
reports drift through metrics and logs
optionally remediates eligible drift when auto-remediation is enabled

flowchart TD
    start[Timer fires] --> phase{Resource phase Ready or Degraded?}
    phase -- No --> skip[Skip resource]
    phase -- Yes --> scan[Inspect Keycloak state]
    scan --> owned{Managed by this operator instance?}
    owned -- No --> unmanaged[Mark unmanaged]
    owned -- Yes --> exists{CR still exists?}
    exists -- No --> orphan[Mark orphaned]
    exists -- Yes --> diff{Spec drift detected?}
    diff -- No --> done[Record healthy]
    diff -- Yes --> drift[Mark config drift]
    orphan --> remediate{Auto-remediate enabled and old enough?}
    drift --> remediate
    remediate -- No --> metrics[Expose metrics and logs]
    remediate -- Yes --> recheck[Re-check CR existence and safety gates]
    recheck --> action[Delete orphan or reconcile config]
    action --> metrics
    unmanaged --> metrics
    skip --> metrics
    done --> metrics

Important Requirement: Admin Events¶

Admin events are mandatory for drift detection.

When drift detection is enabled, the reconciler enforces:

adminEventsEnabled: true
adminEventsDetailsEnabled: true

This is not optional. Drift detection depends on admin events to determine what changed and when.

Helm Configuration¶

Configure drift detection in the operator chart under monitoring.driftDetection.

monitoring:
  driftDetection:
    enabled: true
    intervalSeconds: 300
    autoRemediate: false
    minimumAgeHours: 24
    scope:
      realms: true
      clients: true
      identityProviders: true
      roles: true

Notes:

minimumAgeHours is configurable and defaults to 24 when you do not set it.
auto-remediation only acts on drift that is at least minimumAgeHours old.

Environment Variables¶

If you manage the operator without Helm, the equivalent settings are:

DRIFT_DETECTION_ENABLED=true
DRIFT_DETECTION_INTERVAL_SECONDS=300
DRIFT_DETECTION_AUTO_REMEDIATE=false
DRIFT_DETECTION_MINIMUM_AGE_HOURS=24
DRIFT_DETECTION_SCOPE_REALMS=true
DRIFT_DETECTION_SCOPE_CLIENTS=true
DRIFT_DETECTION_SCOPE_IDENTITY_PROVIDERS=true
DRIFT_DETECTION_SCOPE_ROLES=true

Phases That Are Scanned¶

Drift detection only processes resources in:

Ready
Degraded

Resources in other phases are skipped, including:

Pending
Reconciling
Failed
Updating
Paused

That skip behavior avoids fighting normal reconciliation, error handling, or intentional pause windows.

Ownership and Multi-Operator Behavior¶

Each operator instance manages exactly one Keycloak instance.

Ownership markers written into Keycloak resources allow the operator to distinguish:

resources created by this operator instance
resources created by another operator instance
resources not managed by any operator instance

This matters in multi-operator deployments because orphan cleanup and config remediation are scoped to the current operator instance.

Auto-Remediation¶

When autoRemediate=true:

orphaned resources can be deleted after they exceed minimumAgeHours
configuration drift can be reconciled back toward the CR spec

Safety checks include:

minimum age gate
CR existence re-check before deletion
operator-instance ownership verification

Metrics¶

Drift detection exports these metrics:

keycloak_operator_orphaned_resources{resource_type,operator_instance}
keycloak_operator_config_drift{resource_type,cr_namespace}
keycloak_operator_unmanaged_resources{resource_type}
keycloak_operator_remediation_total{resource_type,action,reason}
keycloak_operator_remediation_errors_total{resource_type,action}
keycloak_operator_drift_check_duration_seconds{resource_type}
keycloak_operator_drift_check_errors_total{resource_type}
keycloak_operator_drift_check_last_success_timestamp

Example queries:

keycloak_operator_orphaned_resources > 0

increase(keycloak_operator_drift_check_errors_total[5m]) > 0

(time() - keycloak_operator_drift_check_last_success_timestamp) > 900

Alerts¶

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: keycloak-operator-drift-alerts
spec:
  groups:
    - name: keycloak-drift
      rules:
        - alert: KeycloakOrphanedResources
          expr: keycloak_operator_orphaned_resources > 0
          for: 30m
        - alert: KeycloakDriftCheckFailure
          expr: increase(keycloak_operator_drift_check_errors_total[5m]) > 3
          for: 5m
        - alert: KeycloakDriftCheckStale
          expr: (time() - keycloak_operator_drift_check_last_success_timestamp) > 900
          for: 5m

Operational Notes¶

drift detection is background protection, not a replacement for normal reconciliation
paused resources are intentionally skipped
enabling auto-remediation is safer in development first, then in production after you verify the ownership and age semantics