ADR-053: Error categorization - temporary vs permanent¶
Category: architecture Provenance: human
Decision¶
All operator errors are categorized as temporary (retryable) or permanent (requires manual intervention). Temporary errors use exponential backoff with jitter. Permanent errors immediately move resource to Failed phase.
Rationale¶
Clear error categorization prevents infinite retry loops on unfixable issues while allowing automatic recovery from transient failures. Temporary errors (network glitches, API throttling) retry automatically with exponential backoff and jitter to prevent thundering herd. Permanent errors (spec validation, RBAC denial) fail fast and provide actionable guidance for manual resolution. This matches Kopf's error handling model (kopf.TemporaryError vs kopf.PermanentError). User action messages guide operators to resolution. Exponential backoff with jitter prevents API overload during mass reconciliation after outages.
Agent Instructions¶
Use error classes from src/keycloak_operator/errors/operator_errors.py. Temporary errors: network issues, rate limits, transient API failures - these auto-retry with exponential backoff. Permanent errors: validation failures, RBAC issues, invalid configuration - these require manual fix. All errors inherit from OperatorError base class. Use as_kopf_error() to convert to kopf.TemporaryError or kopf.PermanentError. Include user_action in error messages to guide resolution. HTTP 4xx = permanent, 5xx = temporary. Network timeouts = temporary. Validation = permanent.
Rejected Alternatives¶
Retry all errors indefinitely¶
Would waste resources retrying unfixable issues like validation errors or RBAC denials. No user feedback on permanent problems.
Never retry errors automatically¶
Would require manual intervention for transient network issues. Poor operational experience during temporary outages.
Fixed retry delays without backoff¶
Creates thundering herd when many reconcilers retry simultaneously after connection resumes.