Drift Detection¶
The Keycloak Operator includes drift detection to monitor the actual state of Keycloak resources and compare them with Kubernetes Custom Resources (CRs). This helps identify:
- Orphaned resources: Resources created by the operator but whose CR has been deleted
- Configuration drift: Resources whose actual state differs from the CR specification
- Unmanaged resources: Resources in Keycloak not managed by any operator instance
Features¶
Resource Ownership Tracking¶
Every Keycloak resource (realm, client, etc.) created by the operator is tagged with ownership attributes:
{
"attributes": {
"io.kubernetes.managed-by": "keycloak-operator",
"io.kubernetes.operator-instance": "keycloak-operator-production",
"io.kubernetes.cr-namespace": "team-a",
"io.kubernetes.cr-name": "my-realm",
"io.kubernetes.created-at": "2025-10-28T12:00:00Z"
}
}
These attributes enable: - Multi-operator deployments (each operator tracks its own resources) - Orphan detection (identify resources whose CR was deleted) - Drift detection (verify CR still matches actual state)
Periodic Drift Scanning¶
The operator runs periodic background scans to check for drift:
- Fetch all resources from Keycloak
- Check ownership using attributes
- Verify CR existence for operator-managed resources
- Compare configuration (future: detect spec drift)
- Emit Prometheus metrics for monitoring
Auto-Remediation (Optional)¶
When enabled, the operator can automatically fix drift:
- Orphaned resources: Delete from Keycloak if older than minimum age (default: 24 hours)
- Configuration drift: Update Keycloak to match CR spec (future feature)
Safety mechanisms: - Minimum age check (default: 24 hours) prevents accidental deletion of newly created resources - Re-check CR existence before deletion to avoid race conditions - Only touches resources with this operator's instance ID
Configuration¶
Configure drift detection via Helm values:
monitoring:
driftDetection:
# Enable drift detection
enabled: true
# Scan interval in seconds (default: 300 = 5 minutes)
intervalSeconds: 300
# Auto-remediate detected drift (default: false)
# WARNING: When enabled, orphaned resources will be automatically deleted
autoRemediate: false
# Minimum age in hours before deleting orphaned resources (default: 24)
# Safety mechanism to prevent accidental deletion
minimumAgeHours: 24
# Scope of drift detection
scope:
realms: true
clients: true
identityProviders: true # Future feature
roles: true # Future feature
Environment Variables¶
If you're not using Helm, configure via environment variables:
DRIFT_DETECTION_ENABLED=true
DRIFT_DETECTION_INTERVAL_SECONDS=300
DRIFT_DETECTION_AUTO_REMEDIATE=false
DRIFT_DETECTION_MINIMUM_AGE_HOURS=24
DRIFT_DETECTION_SCOPE_REALMS=true
DRIFT_DETECTION_SCOPE_CLIENTS=true
DRIFT_DETECTION_SCOPE_IDENTITY_PROVIDERS=true
DRIFT_DETECTION_SCOPE_ROLES=true
Prometheus Metrics¶
The operator exposes the following metrics for drift detection:
Drift Detection Metrics¶
# Number of orphaned resources (created by this operator, CR deleted)
keycloak_operator_orphaned_resources{resource_type, resource_name, operator_instance}
# Number of resources with configuration drift
keycloak_operator_config_drift{resource_type, resource_name, cr_namespace, cr_name}
# Number of unmanaged resources (not created by any operator)
keycloak_unmanaged_resources{resource_type, resource_name}
Remediation Metrics¶
# Total remediation actions performed
keycloak_operator_remediation_total{resource_type, action, reason}
# Total remediation errors
keycloak_operator_remediation_errors_total{resource_type, action}
Health Metrics¶
# Duration of drift detection scans
keycloak_operator_drift_check_duration_seconds{resource_type}
# Total drift check errors
keycloak_operator_drift_check_errors_total{resource_type}
# Unix timestamp of last successful drift check
keycloak_operator_drift_check_last_success_timestamp
Example Prometheus Alerts¶
Create alerts to notify when drift is detected:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: keycloak-operator-drift-alerts
spec:
groups:
- name: keycloak-drift
interval: 30s
rules:
# Alert on orphaned resources
- alert: KeycloakOrphanedResources
expr: keycloak_operator_orphaned_resources > 0
for: 30m
labels:
severity: warning
component: keycloak-operator
annotations:
summary: "Orphaned Keycloak resources detected"
description: |
{{ $value }} orphaned {{ $labels.resource_type }} resource(s) detected.
Resource: {{ $labels.resource_name }}
Operator: {{ $labels.operator_instance }}
# Alert on configuration drift
- alert: KeycloakConfigurationDrift
expr: keycloak_operator_config_drift > 0
for: 15m
labels:
severity: info
component: keycloak-operator
annotations:
summary: "Keycloak configuration drift detected"
description: |
Configuration drift detected for {{ $labels.resource_type }}: {{ $labels.resource_name }}
CR: {{ $labels.cr_namespace }}/{{ $labels.cr_name }}
# Alert on drift check failures
- alert: KeycloakDriftCheckFailure
expr: increase(keycloak_operator_drift_check_errors_total[5m]) > 3
for: 5m
labels:
severity: warning
component: keycloak-operator
annotations:
summary: "Drift detection checks are failing"
description: |
Drift detection for {{ $labels.resource_type }} has failed {{ $value }} times in the last 5 minutes.
# Alert if drift checks haven't run recently
- alert: KeycloakDriftCheckStale
expr: (time() - keycloak_operator_drift_check_last_success_timestamp) > 900
for: 5m
labels:
severity: warning
component: keycloak-operator
annotations:
summary: "Drift detection checks are not running"
description: |
Drift detection has not run successfully in {{ $value | humanizeDuration }}.
Usage Examples¶
Scenario 1: Detect Orphaned Realms¶
-
Create a realm:
-
Delete the CR (simulating accidental deletion):
-
Check metrics (after next drift scan):
-
Manual cleanup (if auto-remediation is disabled):
-
Auto-cleanup (if auto-remediation is enabled and age > 24h):
Scenario 2: Multi-Operator Deployments¶
When running multiple operator instances:
# Operator 1 in production namespace
operator:
instanceId: "keycloak-operator-production"
# Operator 2 in staging namespace
operator:
instanceId: "keycloak-operator-staging"
Each operator only manages resources it created: - Production operator ignores resources created by staging operator - Prevents conflicts and accidental deletions - Clear ownership boundaries
Scenario 3: Identify Unmanaged Resources¶
Find Keycloak resources not managed by any operator:
# Query metrics
curl http://localhost:8081/metrics | grep unmanaged_resources
# Example output:
# keycloak_unmanaged_resources{resource_type="realm",resource_name="legacy-realm"} 1
# keycloak_unmanaged_resources{resource_type="client",resource_name="manual-client"} 1
These are resources that existed before the operator or were created manually.
Options: - Leave them as-is (operator won't touch them) - Manually add ownership attributes to adopt them (not recommended) - Create matching CRs to bring them under operator management (recommended)
Troubleshooting¶
Drift detection is not running¶
Symptoms: keycloak_operator_drift_check_last_success_timestamp is stale
Causes: 1. Drift detection is disabled in Helm values 2. Operator is not running or crashing
Solutions:
# Check if enabled
helm get values keycloak-operator | grep driftDetection
# Check operator logs
kubectl logs -n keycloak-system deployment/keycloak-operator | grep drift
# Ensure at least one realm CR exists
kubectl get keycloakrealms -A
Orphaned resources not being deleted¶
Symptoms: keycloak_operator_orphaned_resources > 0 but resources not deleted
Causes:
1. Auto-remediation is disabled (check autoRemediate: false)
2. Resource age < minimum age (default 24h)
3. Remediation errors (check error metrics)
Solutions:
# Check auto-remediation setting
helm get values keycloak-operator | grep autoRemediate
# Check resource age (must be > minimumAgeHours)
# Resource created_at is in the attributes
# Check for remediation errors
curl http://localhost:8081/metrics | grep remediation_errors_total
# Check operator logs for errors
kubectl logs -n keycloak-system deployment/keycloak-operator | grep remediation
False orphan detection¶
Symptoms: Resources marked as orphaned but CR exists
Causes: 1. CR is in different namespace than expected 2. Ownership attributes don't match actual CR name/namespace 3. Permissions issue (operator can't read CR)
Solutions:
# Verify CR exists and matches ownership attributes
kubectl get keycloakrealm my-realm -n expected-namespace -o yaml
# Check operator RBAC permissions
kubectl auth can-i get keycloakrealms --as=system:serviceaccount:keycloak-system:keycloak-operator-keycloak-system
# Check operator logs for permission errors
kubectl logs -n keycloak-system deployment/keycloak-operator | grep -i "permission\|rbac"
Migration from Existing Resources¶
Breaking Change Notice¶
⚠️ Resources created before this version will NOT be managed for drift detection.
Existing realms and clients lack ownership attributes and will be treated as "unmanaged" resources.
Migration Options¶
Option 1: Recreate Resources (Recommended)¶
- Export existing resource configuration
- Delete the resource from Keycloak
- Recreate via CR (operator will add ownership attributes)
# Backup realm config
kubectl get keycloakrealm my-realm -o yaml > my-realm-backup.yaml
# Delete and recreate
kubectl delete keycloakrealm my-realm
kubectl apply -f my-realm-backup.yaml
Option 2: Manual Attribute Addition (Advanced)¶
Manually add ownership attributes to existing Keycloak resources via Admin API:
# Get current realm
GET /admin/realms/{realm-name}
# Add attributes
PATCH /admin/realms/{realm-name}
{
"attributes": {
"io.kubernetes.managed-by": "keycloak-operator",
"io.kubernetes.operator-instance": "keycloak-operator-<namespace>",
"io.kubernetes.cr-namespace": "<cr-namespace>",
"io.kubernetes.cr-name": "<cr-name>",
"io.kubernetes.created-at": "2025-10-28T12:00:00Z"
}
}
⚠️ Risks: - Incorrect attributes can cause drift detection to malfunction - Easy to make mistakes with namespace/name mapping - Not recommended unless you know what you're doing
Option 3: Leave As Unmanaged (Simplest)¶
Do nothing. Existing resources will show up as "unmanaged" in metrics but won't be affected by drift detection or auto-remediation.
Security Considerations¶
Ownership Attribute Tampering¶
Threat: Someone manually modifies ownership attributes in Keycloak to evade drift detection
Mitigation: - Keycloak Admin API should be restricted (not publicly accessible) - Use Keycloak RBAC to limit who can modify realms/clients - Audit logs should track attribute changes
Unauthorized Resource Deletion¶
Threat: Auto-remediation deletes resources that shouldn't be deleted
Mitigation:
- Auto-remediation is disabled by default
- 24-hour minimum age prevents accidental deletions
- Operator logs all deletions for audit trail
- Monitor remediation_total metric for unexpected deletions
Information Disclosure¶
Threat: Prometheus metrics expose sensitive information about tenants
Mitigation: - Metrics only expose resource names (not secrets, passwords, etc.) - Unmanaged resources are visible (could reveal what exists) - Use Prometheus authentication/authorization to restrict metric access - Consider disabling unmanaged resource metrics if needed
Future Enhancements¶
- Config drift detection: Compare actual Keycloak state with CR spec
- Identity provider drift detection: Track IDP configuration changes
- Role drift detection: Monitor role assignments
- Drift remediation for config changes: Auto-update Keycloak when CR changes
- Grafana dashboard: Pre-built dashboard for drift visualization
- Webhook notifications: Send alerts to Slack/Teams when drift detected
- Dry-run mode: Log what would be remediated without actually doing it
- Per-resource remediation control: Annotation to disable auto-remediation for specific resources