We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Self-Healing Systems: OTP Supervision in Practice
How Erlang/OTP supervision trees enable autonomous fault recovery
Prismatic Engineering
Prismatic Platform
The Let-It-Crash Philosophy
Erlang's "let it crash" philosophy is often misunderstood as recklessness. In reality, it is the most disciplined approach to fault tolerance in software engineering. Instead of trying to anticipate every possible failure and write defensive code for each one, you build a supervision hierarchy that automatically recovers from any failure.
The Prismatic Platform takes this philosophy further with three specialized systems: the SupervisionIntrospector, the RemediationRegistry, and the LessonsLearned system.
Supervision Tree Architecture
The platform's supervision tree is organized into domain supervisors, each responsible for a bounded context:
PrismaticSupervisor (top-level, :one_for_one)
+-- CoreSupervisor (:rest_for_one)
+-- StorageCoordinator
+-- CacheManager
+-- TelemetryCollector
+-- OsintSupervisor (:one_for_one)
+-- AdapterRegistry
+-- SourceManager
+-- MonitoringEngine
+-- SecuritySupervisor (:one_for_one)
+-- ThreatAnalyzer
+-- PerimeterScanner
+-- ComplianceChecker
+-- DDSupervisor (:one_for_one)
+-- CaseManager
+-- EntityResolver
+-- PipelineCoordinator
+-- EvolutionSupervisor (:rest_for_one)
+-- MendelEngine
+-- MycelialNetwork
+-- FitnessEvaluator
Restart Strategy Selection
The choice of restart strategy is not arbitrary:
|----------|-------------|------------------|
:one_for_one:rest_for_one:one_for_allSupervisionIntrospector
The SupervisionIntrospector is a diagnostic GenServer that continuously monitors the health of the supervision tree:
defmodule Prismatic.Singularity.SupervisionIntrospector do
use GenServer
def get_tree_health do
GenServer.call(__MODULE__, :tree_health)
end
def handle_call(:tree_health, _from, state) do
health = %{
total_processes: count_processes(),
restart_counts: collect_restart_counts(),
memory_per_supervisor: memory_breakdown(),
hotspots: identify_restart_hotspots(state.history)
}
{:reply, health, state}
end
end
It tracks restart frequency per child, identifies hotspots (children that restart more than 3 times in 5 minutes), and reports memory consumption per supervisor subtree. This data feeds into the platform's health score calculation.
RemediationRegistry
The RemediationRegistry maps known failure patterns to remediation actions. When a process crashes, the registry checks whether the crash signature matches a known pattern and applies the appropriate fix:
Remediation Actions
Pattern Matching
def lookup_remediation(%RuntimeError{message: "connection refused"}) do
{:circuit_break, target: :external_api, duration_ms: 30_000}
end
def lookup_remediation(%DBConnection.ConnectionError{}) do
{:restart_with_backoff, initial_ms: 1_000, max_ms: 30_000}
end
def lookup_remediation(_unknown) do
{:escalate, reason: :unknown_failure_pattern}
end
Cascade Failure Prevention
The most dangerous failure mode in distributed systems is the cascade -- one component fails, which overloads another, which fails, and so on until the entire system is down.
The platform prevents cascades through three mechanisms:
1. Bulkheads
Each domain supervisor acts as a bulkhead. If the entire OSINT subsystem crashes, it does not affect DD, Security, or any other domain. The top-level supervisor uses :one_for_one precisely for this isolation.
2. Load Shedding
Under extreme load, GenServers can shed non-critical work. The platform implements this through message priority queues -- when the mailbox exceeds 1000 messages, low-priority messages are dropped with a structured warning log.
3. Circuit Breakers
External API calls are wrapped in circuit breakers with three states:
The LessonsLearned System
The most innovative component is the LessonsLearned system, which turns every crash into institutional knowledge:
2. Classify: The crash is classified by type (state corruption, external dependency, resource exhaustion, logic error)
3. Correlate: The system checks whether similar crashes have occurred before and identifies patterns
4. Record: A structured lesson is recorded with the failure mode, root cause, and successful remediation
5. Apply: Future crashes matching the same pattern are automatically remediated using the recorded lesson
Knowledge Accumulation
Over time, the LessonsLearned database grows into a comprehensive failure knowledge base:
Total lessons recorded: 847
Auto-remediated crashes: 73% (of known patterns)
Mean time to recovery: 1.2 seconds (auto-remediated)
Escalation rate: 12% (unknown patterns requiring human review)
False positive rate: 2.1% (incorrect remediation applied)
Monitoring and Alerting
The self-healing system is observable through:
/admin/supervision:telemetry events for every restart, remediation, and escalationThe Compound Effect
Self-healing is not a feature -- it is a compound investment. Every crash that is automatically remediated saves developer time. Every lesson learned prevents future escalations. Every circuit breaker that opens prevents a cascade that would have required manual intervention.
After 19 generations of evolution, the platform's self-healing capabilities have reduced manual incident response by an estimated 85%. The remaining 15% are genuinely novel failure modes that require human creativity to resolve -- and once resolved, they too become lessons in the knowledge base.
Let it crash. Let it heal. Let it learn.