Engineering for predictable failure
In complex systems, failure is a given. What matters is how failure behaves. Predictable failure means errors occur within known boundaries, are quickly detected, and can be recovered without system-wide damage.
Why predictable failure matters
Predictability changes the nature of failure:
- Incident impact: from cascading to contained
- Diagnosis speed: from ambiguous to directional
- Recovery cost: from variable to planned
- Team morale: from erosion to reinforcement
Anatomy of trustworthy failure
- Visibility: failures are easy to detect and diagnose
- Localization: errors stay confined to defined areas
- Degradation: reduced service continues safely
- Recoverability: restoration paths are designed and rehearsed
Patterns for predictable failure
- Fail fast, not fragile: stop early rather than corrupt silently
- Degradation modes: document and design partial service intentionally
- Tight error contracts: make failure handling part of your interface
- Controlled chaos: validate assumptions with structured failure tests
From robustness to resilience
Robust systems try to prevent failure. Resilient systems expect it and recover.
- Robustness: static defense, assumes known threats
- Resilience: adaptive design, accepts unknown conditions
Engineering signals to track
- Time to detect vs. time to recover (TTD/TTR)
- Percentage of systems with clear degradation plans
- Frequency of recovery drills and chaos tests