Back to Blog
Evolution March 21, 2026 | 10 min read

Self-Healing Systems: OTP Supervision in Practice

How Erlang/OTP supervision trees enable autonomous fault recovery

Prismatic Engineering

Prismatic Platform

The Let-It-Crash Philosophy


Erlang's "let it crash" philosophy is often misunderstood as recklessness. In reality, it is the most disciplined approach to fault tolerance in software engineering. Instead of trying to anticipate every possible failure and write defensive code for each one, you build a supervision hierarchy that automatically recovers from any failure.


The Prismatic Platform takes this philosophy further with three specialized systems: the SupervisionIntrospector, the RemediationRegistry, and the LessonsLearned system.


Supervision Tree Architecture


The platform's supervision tree is organized into domain supervisors, each responsible for a bounded context:



PrismaticSupervisor (top-level, :one_for_one)

+-- CoreSupervisor (:rest_for_one)

+-- StorageCoordinator +-- CacheManager +-- TelemetryCollector

+-- OsintSupervisor (:one_for_one)

+-- AdapterRegistry +-- SourceManager +-- MonitoringEngine

+-- SecuritySupervisor (:one_for_one)

+-- ThreatAnalyzer +-- PerimeterScanner +-- ComplianceChecker

+-- DDSupervisor (:one_for_one)

+-- CaseManager +-- EntityResolver +-- PipelineCoordinator

+-- EvolutionSupervisor (:rest_for_one)

+-- MendelEngine

+-- MycelialNetwork

+-- FitnessEvaluator


Restart Strategy Selection


The choice of restart strategy is not arbitrary:


StrategyWhen to UsePlatform Example

|----------|-------------|------------------|

:one_for_oneChildren are independentOsintSupervisor: adapters do not depend on each other :rest_for_oneChildren have ordered dependenciesCoreSupervisor: CacheManager depends on StorageCoordinator :one_for_allChildren are tightly coupledNot used in production (too aggressive)

SupervisionIntrospector


The SupervisionIntrospector is a diagnostic GenServer that continuously monitors the health of the supervision tree:



defmodule Prismatic.Singularity.SupervisionIntrospector do

use GenServer


def get_tree_health do

GenServer.call(__MODULE__, :tree_health)

end


def handle_call(:tree_health, _from, state) do

health = %{

total_processes: count_processes(),

restart_counts: collect_restart_counts(),

memory_per_supervisor: memory_breakdown(),

hotspots: identify_restart_hotspots(state.history)

}

{:reply, health, state}

end

end


It tracks restart frequency per child, identifies hotspots (children that restart more than 3 times in 5 minutes), and reports memory consumption per supervisor subtree. This data feeds into the platform's health score calculation.


RemediationRegistry


The RemediationRegistry maps known failure patterns to remediation actions. When a process crashes, the registry checks whether the crash signature matches a known pattern and applies the appropriate fix:


Remediation Actions


  • Restart with backoff: The default. Exponential backoff prevents rapid restart loops.
  • Restart with state reset: Some crashes are caused by corrupted state. Resetting to a known-good state often fixes the issue permanently.
  • Redirect to fallback: If a primary service is unavailable, redirect to a degraded-mode fallback.
  • Circuit break: If a downstream dependency is failing, open a circuit breaker to prevent cascade failures.
  • Escalate: If no remediation is known, escalate to the parent supervisor and log a structured alert.

  • Pattern Matching


    
    

    def lookup_remediation(%RuntimeError{message: "connection refused"}) do

    {:circuit_break, target: :external_api, duration_ms: 30_000}

    end


    def lookup_remediation(%DBConnection.ConnectionError{}) do

    {:restart_with_backoff, initial_ms: 1_000, max_ms: 30_000}

    end


    def lookup_remediation(_unknown) do

    {:escalate, reason: :unknown_failure_pattern}

    end


    Cascade Failure Prevention


    The most dangerous failure mode in distributed systems is the cascade -- one component fails, which overloads another, which fails, and so on until the entire system is down.


    The platform prevents cascades through three mechanisms:


    1. Bulkheads


    Each domain supervisor acts as a bulkhead. If the entire OSINT subsystem crashes, it does not affect DD, Security, or any other domain. The top-level supervisor uses :one_for_one precisely for this isolation.


    2. Load Shedding


    Under extreme load, GenServers can shed non-critical work. The platform implements this through message priority queues -- when the mailbox exceeds 1000 messages, low-priority messages are dropped with a structured warning log.


    3. Circuit Breakers


    External API calls are wrapped in circuit breakers with three states:


  • Closed (normal operation): requests flow through
  • Open (failure detected): requests are immediately rejected without calling the external service
  • Half-Open (probing): a single request is allowed through to test if the service has recovered

  • The LessonsLearned System


    The most innovative component is the LessonsLearned system, which turns every crash into institutional knowledge:


  • Capture: When a process crashes, the crash report (reason, stacktrace, state snapshot) is captured
  • 2. Classify: The crash is classified by type (state corruption, external dependency, resource exhaustion, logic error)

    3. Correlate: The system checks whether similar crashes have occurred before and identifies patterns

    4. Record: A structured lesson is recorded with the failure mode, root cause, and successful remediation

    5. Apply: Future crashes matching the same pattern are automatically remediated using the recorded lesson


    Knowledge Accumulation


    Over time, the LessonsLearned database grows into a comprehensive failure knowledge base:


    
    

    Total lessons recorded: 847

    Auto-remediated crashes: 73% (of known patterns)

    Mean time to recovery: 1.2 seconds (auto-remediated)

    Escalation rate: 12% (unknown patterns requiring human review)

    False positive rate: 2.1% (incorrect remediation applied)


    Monitoring and Alerting


    The self-healing system is observable through:


  • Health dashboard: Real-time supervision tree visualization at /admin/supervision
  • Restart metrics: :telemetry events for every restart, remediation, and escalation
  • Structured logs: Every crash and remediation is logged with correlation IDs for tracing
  • PubSub events: Components can subscribe to "system_events" for real-time health updates

  • The Compound Effect


    Self-healing is not a feature -- it is a compound investment. Every crash that is automatically remediated saves developer time. Every lesson learned prevents future escalations. Every circuit breaker that opens prevents a cascade that would have required manual intervention.


    After 19 generations of evolution, the platform's self-healing capabilities have reduced manual incident response by an estimated 85%. The remaining 15% are genuinely novel failure modes that require human creativity to resolve -- and once resolved, they too become lessons in the knowledge base.




    Let it crash. Let it heal. Let it learn.


    Tags

    otp supervision self-healing fault-tolerance erlang