Skip to content

ADR-084: Prometheus Metric Hygiene Policy

Category: architecture Provenance: guided-ai

Decision

All Prometheus metrics emitted by the operator must follow these hygiene rules:

  1. Naming: Every metric must use the keycloak_operator_ prefix. No exceptions for subsystem-specific prefixes (e.g. keycloak_api_). A single prefix enables simple service discovery via {name=~"keycloak_operator_.*"} and consistent dashboarding.

  2. Cardinality: Labels must have bounded cardinality. A label is bounded when its value set is determined by a finite enum or a constrained system property (like namespace count), not by the number of managed resources. Labels whose values grow with CRs (resource names, client names, pod names) are prohibited because each unique combination creates a permanent time series in Prometheus.

  3. Per-resource observability: Use structured logging, not metrics, for per-resource visibility. Logs are ephemeral and searchable; metrics are aggregated and expensive to store per-identity.

  4. No dead metrics: Every registered metric must be emitted somewhere in the codebase. Metrics that exist only as declarations without any .labels().set()/.inc()/.observe() calls are misleading and must be removed or wired up.

  5. Terminology accuracy: Metric names must reflect what they actually measure. Avoid overloaded terms (e.g. "token" can mean JWT, rate limiter permit, or session cookie). Prefer domain-specific names like admin_session over token when referring to Keycloak admin auth sessions, and budget over tokens when referring to rate limiter capacity.

Rationale

High-cardinality labels are the primary cause of Prometheus performance degradation. Each unique label combination creates a separate time series that Prometheus must store, index, and query. With N managed resources and a resource_name label, a single metric creates N time series per scrape interval. At scale (hundreds of realms, thousands of clients), this produces millions of time series, leading to OOM, slow queries, and increased storage costs.

A consistent naming prefix is a standard Prometheus convention (see prometheus.io/docs/practices/naming). Mixed prefixes make it impossible to write simple relabeling rules, scrape configs, or PromQL queries that target all operator metrics. Users should not need to know internal subsystem boundaries to find metrics.

Dead metrics (declared but never emitted) create false expectations. Users build dashboards and alerts referencing metrics that always return zero, then file bug reports. Every metric in the registry must be backed by actual instrumentation.

Agent Instructions

When adding or modifying Prometheus metrics, enforce these rules:

ALLOWED labels (bounded cardinality): resource_type - finite enum: keycloak, realm, client namespace - bounded by cluster namespace count, typically <100 phase/status - finite enum: Ready, Failed, Pending, Degraded, etc. operation - finite enum: reconcile, update, delete error_type - finite enum: circuit_breaker, api_error, timeout, etc. result - finite enum: success, failure retryable - boolean: true, false limit_type - finite enum: global, namespace action/reason - finite enums for remediation actions operator_instance - single value per deployment instance_id - single value per pod

PROHIBITED labels (unbounded cardinality): name / resource_name / cr_name - grows with managed resources client_name / federation_name - grows with managed resources instance_name / cluster_name - arbitrary user-chosen strings previous_leader / new_leader - ephemeral pod names Any label whose distinct value count scales with the number of CRs

Before adding a new metric, check: 1. Does it use keycloak_operator_ prefix? 2. Are ALL labels bounded? Ask: "if a user has 10,000 clients, how many time series?" 3. Is there an actual emission call in the codebase, or is it just a declaration? 4. Could this information be served better by a log line with correlation IDs?

The canonical metric definitions live in src/keycloak_operator/observability/metrics.py. The MetricsCollector class provides the interface; handlers must not use raw metric objects directly.

Rejected Alternatives

Keep resource_name labels but add recording rules to pre-aggregate

Recording rules reduce query cost but not ingestion or storage cost. The raw high-cardinality series still consume memory and disk. Also adds operational complexity for users who must deploy the recording rules alongside the operator.

Use OpenTelemetry for per-resource metrics and Prometheus for aggregates

Adds an infrastructure dependency (OTel collector, backend). Structured logging with correlation IDs already provides per-resource observability without the cardinality problem and without requiring additional infrastructure.

Allow subsystem-specific prefixes (keycloak_api_, keycloak_operator_)

Users should not need internal knowledge of operator architecture to discover metrics. A single prefix enables {name=~'keycloak_operator_.*'} for all operator metrics.

Use metric label allowlists enforced at scrape time

Pushes the problem to the user's Prometheus config. The operator should emit clean metrics by default, not require users to filter out cardinality bombs.