Observability¶

This document describes the observability features available in the Keycloak operator, including status conditions, metrics, and monitoring capabilities.

Status Conditions¶

All custom resources (Keycloak, KeycloakRealm, KeycloakClient) expose Kubernetes-standard status conditions that can be used by GitOps tools like Argo CD and Flux CD to determine resource health.

Standard Conditions¶

Each resource implements the following condition types:

Ready¶

Indicates whether the resource is fully reconciled and operational.

Status: True, False, or Unknown
Reason: ReconciliationSucceeded, ReconciliationFailed, ReconciliationInProgress
Usage: Primary health indicator for GitOps tools

Available¶

Indicates whether the resource is available for use (Kubernetes standard).

Status: True or False
Reason: ReconciliationSucceeded, ReconciliationFailed
Usage: Determines if the resource can serve its purpose

Progressing¶

Indicates an ongoing reconciliation operation (Kubernetes standard).

Status: True or False
Reason: ReconciliationInProgress
Usage: Shows active reconciliation work

Degraded¶

Indicates the resource is operational but not in optimal state.

Status: True or False
Reason: PartialFunctionality, ReconciliationFailed
Usage: Alerts about suboptimal conditions

Checking Resource Status¶

View the status of a resource:

# Get resource with status
kubectl get keycloak my-keycloak -o yaml

# Check conditions specifically
kubectl get keycloak my-keycloak -o jsonpath='{.status.conditions}' | jq

# Check if a resource is ready
kubectl get keycloak my-keycloak -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

Example Status Output¶

status:
  phase: Ready
  message: Keycloak instance is ready
  lastUpdated: "2025-10-15T20:00:00Z"
  observedGeneration: 5
  conditions:
    - type: Ready
      status: "True"
      reason: ReconciliationSucceeded
      message: Reconciliation completed successfully
      lastTransitionTime: "2025-10-15T20:00:00Z"
      observedGeneration: 5
    - type: Available
      status: "True"
      reason: ReconciliationSucceeded
      message: Resource is available
      lastTransitionTime: "2025-10-15T20:00:00Z"
      observedGeneration: 5
  deployment: my-keycloak-keycloak
  service: my-keycloak-keycloak
  endpoints:
    admin: http://my-keycloak-keycloak.default.svc.cluster.local:8080
    public: http://my-keycloak-keycloak.default.svc.cluster.local:8080
    management: http://my-keycloak-keycloak.default.svc.cluster.local:9000

ObservedGeneration¶

All resources track observedGeneration which indicates the generation of the spec that was last reconciled. This is crucial for GitOps workflows:

Match: When status.observedGeneration equals metadata.generation, the resource is fully reconciled
Mismatch: When they differ, reconciliation is pending or in progress
Usage: GitOps tools use this to detect drift and sync status

Example check:

# Check if resource is fully synced
kubectl get keycloak my-keycloak -o json | \
  jq 'if .status.observedGeneration == .metadata.generation then "Synced" else "OutOfSync" end'

Resource-Specific Status Fields¶

Keycloak Status¶

status:
  deployment: my-keycloak-keycloak  # Name of the deployment
  service: my-keycloak-keycloak      # Name of the service
  adminSecret: my-keycloak-admin-credentials  # Admin credentials secret
  endpoints:
    admin: http://...    # Admin API endpoint
    public: http://...   # Public endpoint
    management: http://... # Management endpoint (health checks)

KeycloakRealm Status¶

status:
  realmName: my-realm  # Actual realm name in Keycloak
  keycloakInstance: default/keycloak  # Referenced Keycloak instance
  features:
    userRegistration: true
    passwordReset: true
    identityProviders: 2
    userFederationProviders: 1
    customThemes: true

KeycloakClient Status¶

status:
  client_id: my-client  # Client ID
  client_uuid: abc-123  # UUID in Keycloak
  realm: my-realm  # Realm name
  keycloak_instance: default/keycloak  # Keycloak instance reference
  credentials_secret: my-client-credentials  # Client credentials secret
  public_client: false  # Whether this is a public client
  endpoints:
    auth: https://keycloak.example.com/realms/my-realm
    token: https://keycloak.example.com/realms/my-realm/protocol/openid-connect/token
    userinfo: https://keycloak.example.com/realms/my-realm/protocol/openid-connect/userinfo

Prometheus Metrics¶

The operator exposes Prometheus metrics on port 8081 at /metrics.

Available Metrics¶

Reconciliation Metrics¶

# Reconciliation operations counter
keycloak_operator_reconciliation_total{resource_type="keycloak|realm|client", namespace="...", result="success|failure"}

# Reconciliation duration histogram
keycloak_operator_reconciliation_duration_seconds{resource_type="...", namespace="...", operation="reconcile|update|delete"}

# Active resources gauge
keycloak_operator_active_resources{resource_type="...", namespace="...", phase="Ready|Failed|Pending"}

Resource Status Metrics¶

# Resource status by phase
keycloak_operator_active_resources{resource_type="keycloak|realm|client", namespace="...", phase="Ready|Failed|Pending"}

Error Metrics¶

# Error counter by type
keycloak_operator_reconciliation_errors_total{error_type="...", resource_type="...", namespace="...", retryable="true|false"}

Scraping Metrics¶

Configure Prometheus to scrape the operator:

apiVersion: v1
kind: Service
metadata:
  name: keycloak-operator-metrics
  labels:
    app: keycloak-operator
spec:
  ports:
    - name: metrics
      port: 8081
      targetPort: 8081
  selector:
    app: keycloak-operator
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: keycloak-operator
spec:
  selector:
    matchLabels:
      app: keycloak-operator
  endpoints:
    - port: metrics
      interval: 30s

Logging¶

The operator uses structured logging with correlation IDs for request tracing.

Log Levels¶

DEBUG: Detailed operational information
INFO: General operational messages
WARNING: Warning conditions (degraded but functioning)
ERROR: Error conditions requiring attention

Viewing Logs¶

# Follow operator logs
kubectl logs -f -l app=keycloak-operator -n keycloak-operator-system

# View logs with correlation ID
kubectl logs -l app=keycloak-operator -n keycloak-operator-system | grep "correlation_id=abc-123"

# Check reconciliation logs for specific resource
kubectl logs -l app=keycloak-operator -n keycloak-operator-system | \
  grep "resource_name=my-keycloak"

Log Format¶

Logs include structured fields:

{
  "timestamp": "2025-10-15T20:00:00Z",
  "level": "INFO",
  "logger": "KeycloakReconciler",
  "message": "Reconciliation completed successfully",
  "resource_type": "keycloak",
  "resource_name": "my-keycloak",
  "namespace": "default",
  "correlation_id": "abc-123",
  "duration": 2.5
}

Health Checks¶

The operator pod exposes health endpoints:

Liveness: HTTP GET on /healthz (port 8081)
Readiness: HTTP GET on /ready (port 8081)

GitOps Integration¶

Argo CD Health Assessment¶

Argo CD automatically uses the Ready condition to determine resource health:

# Argo CD will show:
# - Healthy: Ready=True
# - Progressing: Progressing=True or observedGeneration mismatch
# - Degraded: Ready=False or Degraded=True

Flux CD Health Assessment¶

Flux CD checks the Ready condition and observedGeneration:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: keycloak-resources
spec:
  healthChecks:
    - apiVersion: vriesdemichael.github.io/v1
      kind: Keycloak
      name: my-keycloak
      namespace: default

Circuit Breaker Status¶

The operator uses a circuit breaker to protect the Keycloak API from overload. When the circuit breaker opens:

The operator logs: Circuit breaker open for Keycloak at http://...
API calls return HTTP 503 (Service Unavailable)
Reconciliation is retried with exponential backoff
The circuit resets after 60 seconds of no failures

Check circuit breaker state in logs:

kubectl logs -l app=keycloak-operator | grep "circuit breaker"

Troubleshooting with Status¶

Resource Stuck in Pending¶

# Check status conditions
kubectl describe keycloak my-keycloak

# Look for the message in status
kubectl get keycloak my-keycloak -o jsonpath='{.status.message}'

# Check if generation matches (sync status)
kubectl get keycloak my-keycloak -o json | \
  jq '{generation: .metadata.generation, observedGeneration: .status.observedGeneration}'

Reconciliation Failures¶

# Check Ready condition for reason
kubectl get keycloak my-keycloak -o json | \
  jq '.status.conditions[] | select(.type=="Ready")'

# View recent events
kubectl get events --field-selector involvedObject.name=my-keycloak

# Check operator logs for this resource
kubectl logs -l app=keycloak-operator | grep "resource_name=my-keycloak"

Performance Issues¶

# Query Prometheus for slow reconciliations
histogram_quantile(0.95,
  rate(kopf_reconciliation_duration_seconds_bucket[5m])
) by (resource_type)

# Check active reconciliation count
kopf_reconciliation_active

Distributed Tracing¶

The Keycloak operator supports OpenTelemetry distributed tracing for end-to-end visibility into reconciliation operations. When enabled, traces are exported to an OTLP collector and can be viewed in tools like Jaeger, Tempo, or any OTEL-compatible backend.

Enabling Tracing¶

Configure tracing in your Helm values:

operator:
  tracing:
    # Enable OpenTelemetry tracing
    enabled: true

    # OTLP collector endpoint (gRPC protocol)
    # Examples:
    # - "http://otel-collector.monitoring:4317" (in-cluster)
    # - "http://tempo.monitoring:4317" (Grafana Tempo)
    # - "http://jaeger-collector.monitoring:4317" (Jaeger)
    endpoint: "http://otel-collector.monitoring:4317"

    # Service name for traces (identifies the operator)
    serviceName: "keycloak-operator"

    # Trace sampling rate (0.0-1.0)
    # 1.0 = 100% of traces, 0.1 = 10% of traces
    # Lower values reduce overhead in high-throughput environments
    sampleRate: 1.0

    # Use insecure connection to OTLP collector (no TLS)
    insecure: true

    # Propagate tracing to managed Keycloak instances
    # Enables end-to-end distributed tracing
    propagateToKeycloak: true

What Gets Traced¶

When tracing is enabled, the operator creates spans for:

Kopf Handlers: Reconciliation operations for Keycloak, KeycloakRealm, and KeycloakClient resources
HTTP Requests: All outgoing HTTP requests to Keycloak are automatically instrumented
Keycloak API Calls: Admin API operations include trace context

Each span includes semantic attributes:

k8s.namespace: default
k8s.resource.name: my-keycloak
k8s.resource.type: keycloak
kopf.handler: handle_keycloak_create

End-to-End Tracing with Keycloak¶

When propagateToKeycloak: true, the operator configures managed Keycloak instances to export traces to the same collector. This enables:

Visibility into Keycloak internal operations (authentication, token issuance)
Trace correlation between operator reconciliation and Keycloak processing
Full request lifecycle from operator to Keycloak database

Requirements: Keycloak 26.x or later (has built-in OpenTelemetry support via Quarkus)

The Keycloak CR will automatically include:

apiVersion: vriesdemichael.github.io/v1
kind: Keycloak
metadata:
  name: example
spec:
  tracing:
    enabled: true
    endpoint: "http://otel-collector.monitoring:4317"
    serviceName: "keycloak"
    sampleRate: 1.0

Viewing Traces¶

Jaeger¶

# Port-forward Jaeger UI
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686

# Open in browser: http://localhost:16686
# Search for service: keycloak-operator

Grafana Tempo¶

# Access Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Navigate to Explore > Tempo
# Search by service name: keycloak-operator

Trace Propagation¶

The operator uses W3C Trace Context (traceparent header) for trace propagation. This is automatically added to:

Keycloak Admin API requests
Any HTTP requests made via httpx or aiohttp clients

Example trace context header:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Debugging with Traces¶

Traces are particularly useful for debugging:

Slow Reconciliations: Identify which Keycloak API calls are slow
Failures: See the exact sequence of operations before an error
Cross-Service Issues: Trace requests from operator through Keycloak to database

Example: Finding slow realm reconciliations

Search for traces with service.name = keycloak-operator
Filter by operation: reconcile_realm
Sort by duration to find outliers
Drill into spans to see individual API calls

Environment Variables¶

The following environment variables control tracing:

Variable	Description	Default
`OTEL_TRACING_ENABLED`	Enable tracing	`false`
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP collector endpoint	`http://localhost:4317`
`OTEL_SERVICE_NAME`	Service name for traces	`keycloak-operator`
`OTEL_SAMPLE_RATE`	Sampling rate (0.0-1.0)	`1.0`
`OTEL_EXPORTER_OTLP_INSECURE`	Use insecure connection	`true`
`OTEL_PROPAGATE_TO_KEYCLOAK`	Propagate to Keycloak	`true`

Integration Examples¶

With OpenTelemetry Collector¶

Deploy the OpenTelemetry Collector to receive and export traces:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      batch:
        timeout: 1s

    exporters:
      jaeger:
        endpoint: jaeger-collector.monitoring:14250
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: collector
        image: otel/opentelemetry-collector-contrib:0.96.0
        ports:
        - containerPort: 4317
          name: otlp-grpc
        volumeMounts:
        - name: config
          mountPath: /etc/otelcol-contrib/config.yaml
          subPath: config.yaml
      volumes:
      - name: config
        configMap:
          name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    app: otel-collector
  ports:
  - port: 4317
    name: otlp-grpc

With Grafana Tempo¶

operator:
  tracing:
    enabled: true
    endpoint: "http://tempo.monitoring:4317"
    serviceName: "keycloak-operator"

Performance Considerations¶

Sampling: For high-throughput environments, reduce sampleRate (e.g., 0.1 for 10%)
Batch Processing: The operator uses BatchSpanProcessor for efficient trace export
Overhead: With 1.0 sampling, expect ~5-10% overhead on reconciliation time
Storage: Traces consume storage in your backend; configure retention appropriately

Debugging Test Failures with Traces¶

The operator's integration test infrastructure includes trace collection for post-mortem debugging of test failures.

How It Works¶

OTEL Collector Deployment: The test cluster includes an OpenTelemetry Collector that writes traces to JSONL files
Test Context Markers: Each test is logged with [TRACE_CONTEXT] markers that include test names and timestamps
Trace Retrieval: After tests complete, traces are extracted from the collector pod and saved as artifacts

Analyzing Traces After CI Failures¶

When integration tests fail in CI:

Download the test-logs-* artifact from the failed GitHub Actions run
Look in test-logs/traces/ for traces.jsonl
Use the analysis tool to find relevant traces:

# Show summary of all traces
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --summary

# Show only error spans
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --errors-only

# Filter by test name
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --filter "test_create_realm"

# Show traces in tree format
python scripts/analyze-trace.py test-logs/traces/traces.jsonl --tree

# Filter by time range (use timestamps from test logs)
python scripts/analyze-trace.py test-logs/traces/traces.jsonl \
    --time-range "2024-01-01T10:00:00" "2024-01-01T10:05:00"

Correlating Traces with Tests¶

Test logs include markers like:

[TRACE_CONTEXT] START tests/integration/test_realm.py::test_create 2024-01-01T10:00:00.123456+00:00
[TRACE_CONTEXT] END tests/integration/test_realm.py::test_create 2024-01-01T10:00:05.654321+00:00 duration=5531ms outcome=passed

Use these timestamps with --time-range to find traces for specific tests.

Local Debugging with Traces¶

When running tests locally with task test:all, traces are collected to .tmp/traces/:

# Run tests
task test:all

# Analyze traces from the test run
python scripts/analyze-trace.py .tmp/traces/traces.jsonl --summary
python scripts/analyze-trace.py .tmp/traces/traces.jsonl --errors-only

Trace Content¶

Traces capture:

Reconciliation loops: Start/end of each reconcile operation
Keycloak API calls: HTTP method, endpoint, status code, duration
Resource operations: Create, update, delete of Keycloak resources
Errors: Exception details and stack traces
Context: Namespace, resource name, reconciliation phase