Observability¶
This document describes the observability features available in the Keycloak operator, including status conditions, metrics, and monitoring capabilities.
Status Conditions¶
All custom resources (Keycloak, KeycloakRealm, KeycloakClient) expose Kubernetes-standard status conditions that can be used by GitOps tools like Argo CD and Flux CD to determine resource health.
Standard Conditions¶
Each resource implements the following condition types:
Ready¶
Indicates whether the resource is fully reconciled and operational.
- Status:
True,False, orUnknown - Reason:
ReconciliationSucceeded,ReconciliationFailed,ReconciliationInProgress - Usage: Primary health indicator for GitOps tools
Available¶
Indicates whether the resource is available for use (Kubernetes standard).
- Status:
TrueorFalse - Reason:
ReconciliationSucceeded,ReconciliationFailed - Usage: Determines if the resource can serve its purpose
Progressing¶
Indicates an ongoing reconciliation operation (Kubernetes standard).
- Status:
TrueorFalse - Reason:
ReconciliationInProgress - Usage: Shows active reconciliation work
Degraded¶
Indicates the resource is operational but not in optimal state.
- Status:
TrueorFalse - Reason:
PartialFunctionality,ReconciliationFailed - Usage: Alerts about suboptimal conditions
Checking Resource Status¶
View the status of a resource:
# Get resource with status
kubectl get keycloak my-keycloak -o yaml
# Check conditions specifically
kubectl get keycloak my-keycloak -o jsonpath='{.status.conditions}' | jq
# Check if a resource is ready
kubectl get keycloak my-keycloak -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
Example Status Output¶
status:
phase: Ready
message: Keycloak instance is ready
lastUpdated: "2025-10-15T20:00:00Z"
observedGeneration: 5
conditions:
- type: Ready
status: "True"
reason: ReconciliationSucceeded
message: Reconciliation completed successfully
lastTransitionTime: "2025-10-15T20:00:00Z"
observedGeneration: 5
- type: Available
status: "True"
reason: ReconciliationSucceeded
message: Resource is available
lastTransitionTime: "2025-10-15T20:00:00Z"
observedGeneration: 5
deployment: my-keycloak-keycloak
service: my-keycloak-keycloak
endpoints:
admin: http://my-keycloak-keycloak.default.svc.cluster.local:8080
public: http://my-keycloak-keycloak.default.svc.cluster.local:8080
management: http://my-keycloak-keycloak.default.svc.cluster.local:9000
ObservedGeneration¶
All resources track observedGeneration which indicates the generation of the spec that was last reconciled. This is crucial for GitOps workflows:
- Match: When
status.observedGenerationequalsmetadata.generation, the resource is fully reconciled - Mismatch: When they differ, reconciliation is pending or in progress
- Usage: GitOps tools use this to detect drift and sync status
Example check:
# Check if resource is fully synced
kubectl get keycloak my-keycloak -o json | \
jq 'if .status.observedGeneration == .metadata.generation then "Synced" else "OutOfSync" end'
Resource-Specific Status Fields¶
Keycloak Status¶
status:
deployment: my-keycloak-keycloak # Name of the deployment
service: my-keycloak-keycloak # Name of the service
adminSecret: my-keycloak-admin-credentials # Admin credentials secret
endpoints:
admin: http://... # Admin API endpoint
public: http://... # Public endpoint
management: http://... # Management endpoint (health checks)
KeycloakRealm Status¶
status:
realmName: my-realm # Actual realm name in Keycloak
keycloakInstance: default/keycloak # Referenced Keycloak instance
features:
userRegistration: true
passwordReset: true
identityProviders: 2
userFederationProviders: 1
customThemes: true
KeycloakClient Status¶
status:
client_id: my-client # Client ID
client_uuid: abc-123 # UUID in Keycloak
realm: my-realm # Realm name
keycloak_instance: default/keycloak # Keycloak instance reference
credentials_secret: my-client-credentials # Client credentials secret
public_client: false # Whether this is a public client
endpoints:
auth: https://keycloak.example.com/realms/my-realm
token: https://keycloak.example.com/realms/my-realm/protocol/openid-connect/token
userinfo: https://keycloak.example.com/realms/my-realm/protocol/openid-connect/userinfo
Prometheus Metrics¶
The operator exposes Prometheus metrics on port 8081 at /metrics.
Available Metrics¶
Reconciliation Metrics¶
# Reconciliation operations counter
keycloak_operator_reconciliation_total{resource_type="keycloak|realm|client", namespace="...", result="success|failure"}
# Reconciliation duration histogram
keycloak_operator_reconciliation_duration_seconds{resource_type="...", namespace="...", operation="reconcile|update|delete"}
# Active resources gauge
keycloak_operator_active_resources{resource_type="...", namespace="...", phase="Ready|Failed|Pending"}
Resource Status Metrics¶
# Resource status by phase
keycloak_operator_active_resources{resource_type="keycloak|realm|client", namespace="...", phase="Ready|Failed|Pending"}
Error Metrics¶
# Error counter by type
keycloak_operator_reconciliation_errors_total{error_type="...", resource_type="...", namespace="...", retryable="true|false"}
Scraping Metrics¶
Configure Prometheus to scrape the operator:
apiVersion: v1
kind: Service
metadata:
name: keycloak-operator-metrics
labels:
app: keycloak-operator
spec:
ports:
- name: metrics
port: 8081
targetPort: 8081
selector:
app: keycloak-operator
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: keycloak-operator
spec:
selector:
matchLabels:
app: keycloak-operator
endpoints:
- port: metrics
interval: 30s
Logging¶
The operator uses structured logging with correlation IDs for request tracing.
Log Levels¶
- DEBUG: Detailed operational information
- INFO: General operational messages
- WARNING: Warning conditions (degraded but functioning)
- ERROR: Error conditions requiring attention
Viewing Logs¶
# Follow operator logs
kubectl logs -f -l app=keycloak-operator -n keycloak-operator-system
# View logs with correlation ID
kubectl logs -l app=keycloak-operator -n keycloak-operator-system | grep "correlation_id=abc-123"
# Check reconciliation logs for specific resource
kubectl logs -l app=keycloak-operator -n keycloak-operator-system | \
grep "resource_name=my-keycloak"
Log Format¶
Logs include structured fields:
{
"timestamp": "2025-10-15T20:00:00Z",
"level": "INFO",
"logger": "KeycloakReconciler",
"message": "Reconciliation completed successfully",
"resource_type": "keycloak",
"resource_name": "my-keycloak",
"namespace": "default",
"correlation_id": "abc-123",
"duration": 2.5
}
Health Checks¶
The operator pod exposes health endpoints:
- Liveness: HTTP GET on
/healthz(port 8081) - Readiness: HTTP GET on
/ready(port 8081)
GitOps Integration¶
Argo CD Health Assessment¶
Argo CD automatically uses the Ready condition to determine resource health:
# Argo CD will show:
# - Healthy: Ready=True
# - Progressing: Progressing=True or observedGeneration mismatch
# - Degraded: Ready=False or Degraded=True
Flux CD Health Assessment¶
Flux CD checks the Ready condition and observedGeneration:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: keycloak-resources
spec:
healthChecks:
- apiVersion: vriesdemichael.github.io/v1
kind: Keycloak
name: my-keycloak
namespace: default
Circuit Breaker Status¶
The operator uses a circuit breaker to protect the Keycloak API from overload. When the circuit breaker opens:
- The operator logs:
Circuit breaker open for Keycloak at http://... - API calls return HTTP 503 (Service Unavailable)
- Reconciliation is retried with exponential backoff
- The circuit resets after 60 seconds of no failures
Check circuit breaker state in logs:
Troubleshooting with Status¶
Resource Stuck in Pending¶
# Check status conditions
kubectl describe keycloak my-keycloak
# Look for the message in status
kubectl get keycloak my-keycloak -o jsonpath='{.status.message}'
# Check if generation matches (sync status)
kubectl get keycloak my-keycloak -o json | \
jq '{generation: .metadata.generation, observedGeneration: .status.observedGeneration}'
Reconciliation Failures¶
# Check Ready condition for reason
kubectl get keycloak my-keycloak -o json | \
jq '.status.conditions[] | select(.type=="Ready")'
# View recent events
kubectl get events --field-selector involvedObject.name=my-keycloak
# Check operator logs for this resource
kubectl logs -l app=keycloak-operator | grep "resource_name=my-keycloak"
Performance Issues¶
# Query Prometheus for slow reconciliations
histogram_quantile(0.95,
rate(kopf_reconciliation_duration_seconds_bucket[5m])
) by (resource_type)
# Check active reconciliation count
kopf_reconciliation_active
Distributed Tracing¶
The Keycloak operator supports OpenTelemetry distributed tracing for end-to-end visibility into reconciliation operations. When enabled, traces are exported to an OTLP collector and can be viewed in tools like Jaeger, Tempo, or any OTEL-compatible backend.
Enabling Tracing¶
Configure tracing in your Helm values:
operator:
tracing:
# Enable OpenTelemetry tracing
enabled: true
# OTLP collector endpoint (gRPC protocol)
# Examples:
# - "http://otel-collector.monitoring:4317" (in-cluster)
# - "http://tempo.monitoring:4317" (Grafana Tempo)
# - "http://jaeger-collector.monitoring:4317" (Jaeger)
endpoint: "http://otel-collector.monitoring:4317"
# Service name for traces (identifies the operator)
serviceName: "keycloak-operator"
# Trace sampling rate (0.0-1.0)
# 1.0 = 100% of traces, 0.1 = 10% of traces
# Lower values reduce overhead in high-throughput environments
sampleRate: 1.0
# Use insecure connection to OTLP collector (no TLS)
insecure: true
# Propagate tracing to managed Keycloak instances
# Enables end-to-end distributed tracing
propagateToKeycloak: true
What Gets Traced¶
When tracing is enabled, the operator creates spans for:
- Kopf Handlers: Reconciliation operations for Keycloak, KeycloakRealm, and KeycloakClient resources
- HTTP Requests: All outgoing HTTP requests to Keycloak are automatically instrumented
- Keycloak API Calls: Admin API operations include trace context
Each span includes semantic attributes:
k8s.namespace: default
k8s.resource.name: my-keycloak
k8s.resource.type: keycloak
kopf.handler: handle_keycloak_create
End-to-End Tracing with Keycloak¶
When propagateToKeycloak: true, the operator configures managed Keycloak instances to export traces to the same collector. This enables:
- Visibility into Keycloak internal operations (authentication, token issuance)
- Trace correlation between operator reconciliation and Keycloak processing
- Full request lifecycle from operator to Keycloak database
Requirements: Keycloak 26.x or later (has built-in OpenTelemetry support via Quarkus)
The Keycloak CR will automatically include:
apiVersion: vriesdemichael.github.io/v1
kind: Keycloak
metadata:
name: example
spec:
tracing:
enabled: true
endpoint: "http://otel-collector.monitoring:4317"
serviceName: "keycloak"
sampleRate: 1.0
Viewing Traces¶
Jaeger¶
# Port-forward Jaeger UI
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686
# Open in browser: http://localhost:16686
# Search for service: keycloak-operator
Grafana Tempo¶
# Access Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Navigate to Explore > Tempo
# Search by service name: keycloak-operator
Trace Propagation¶
The operator uses W3C Trace Context (traceparent header) for trace propagation. This is automatically added to:
- Keycloak Admin API requests
- Any HTTP requests made via httpx or aiohttp clients
Example trace context header:
Debugging with Traces¶
Traces are particularly useful for debugging:
- Slow Reconciliations: Identify which Keycloak API calls are slow
- Failures: See the exact sequence of operations before an error
- Cross-Service Issues: Trace requests from operator through Keycloak to database
Example: Finding slow realm reconciliations
- Search for traces with
service.name = keycloak-operator - Filter by operation:
reconcile_realm - Sort by duration to find outliers
- Drill into spans to see individual API calls
Environment Variables¶
The following environment variables control tracing:
| Variable | Description | Default |
|---|---|---|
OTEL_TRACING_ENABLED |
Enable tracing | false |
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP collector endpoint | http://localhost:4317 |
OTEL_SERVICE_NAME |
Service name for traces | keycloak-operator |
OTEL_SAMPLE_RATE |
Sampling rate (0.0-1.0) | 1.0 |
OTEL_EXPORTER_OTLP_INSECURE |
Use insecure connection | true |
OTEL_PROPAGATE_TO_KEYCLOAK |
Propagate to Keycloak | true |
Integration Examples¶
With OpenTelemetry Collector¶
Deploy the OpenTelemetry Collector to receive and export traces:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 1s
exporters:
jaeger:
endpoint: jaeger-collector.monitoring:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.96.0
ports:
- containerPort: 4317
name: otlp-grpc
volumeMounts:
- name: config
mountPath: /etc/otelcol-contrib/config.yaml
subPath: config.yaml
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
app: otel-collector
ports:
- port: 4317
name: otlp-grpc
With Grafana Tempo¶
operator:
tracing:
enabled: true
endpoint: "http://tempo.monitoring:4317"
serviceName: "keycloak-operator"
Performance Considerations¶
- Sampling: For high-throughput environments, reduce
sampleRate(e.g., 0.1 for 10%) - Batch Processing: The operator uses
BatchSpanProcessorfor efficient trace export - Overhead: With 1.0 sampling, expect ~5-10% overhead on reconciliation time
- Storage: Traces consume storage in your backend; configure retention appropriately
Debugging Test Failures with Traces¶
The operator's integration test infrastructure includes trace collection for post-mortem debugging of test failures.
How It Works¶
- OTEL Collector Deployment: The test cluster includes an OpenTelemetry Collector that writes traces to JSONL files
- Test Context Markers: Each test is logged with
[TRACE_CONTEXT]markers that include test names and timestamps - Trace Retrieval: After tests complete, traces are extracted from the collector pod and saved as artifacts
Analyzing Traces After CI Failures¶
When integration tests fail in CI:
- Download the
test-logs-*artifact from the failed GitHub Actions run - Look in
test-logs/traces/fortraces.jsonl - Use the analysis tool to find relevant traces:
# Show summary of all traces
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --summary
# Show only error spans
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --errors-only
# Filter by test name
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --filter "test_create_realm"
# Show traces in tree format
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --tree
# Filter by time range (use timestamps from test logs)
python scripts/analyze-trace.py test-logs/traces/traces.jsonl \
--time-range "2024-01-01T10:00:00" "2024-01-01T10:05:00"
Correlating Traces with Tests¶
Test logs include markers like:
[TRACE_CONTEXT] START tests/integration/test_realm.py::test_create 2024-01-01T10:00:00.123456+00:00
[TRACE_CONTEXT] END tests/integration/test_realm.py::test_create 2024-01-01T10:00:05.654321+00:00 duration=5531ms outcome=passed
Use these timestamps with --time-range to find traces for specific tests.
Local Debugging with Traces¶
When running tests locally with task test:all, traces are collected to .tmp/traces/:
# Run tests
task test:all
# Analyze traces from the test run
python scripts/analyze-trace.py .tmp/traces/traces.jsonl --summary
python scripts/analyze-trace.py .tmp/traces/traces.jsonl --errors-only
Trace Content¶
Traces capture:
- Reconciliation loops: Start/end of each reconcile operation
- Keycloak API calls: HTTP method, endpoint, status code, duration
- Resource operations: Create, update, delete of Keycloak resources
- Errors: Exception details and stack traces
- Context: Namespace, resource name, reconciliation phase