Prismatic · Enterprise AI Orchestration

The Let-It-Crash Philosophy

Erlang's "let it crash" philosophy is often misunderstood as recklessness. In reality, it is the most disciplined approach to fault tolerance in software engineering. Instead of trying to anticipate every possible failure and write defensive code for each one, you build a supervision hierarchy that automatically recovers from any failure.

The Prismatic Platform takes this philosophy further with three specialized systems: the SupervisionIntrospector, the RemediationRegistry, and the LessonsLearned system.

Supervision Tree Architecture

The platform's supervision tree is organized into domain supervisors, each responsible for a bounded context:


PrismaticSupervisor (top-level, :one_for_one)
+-- CoreSupervisor (:rest_for_one)
+-- StorageCoordinator
+-- CacheManager
+-- TelemetryCollector
+-- OsintSupervisor (:one_for_one)
+-- AdapterRegistry
+-- SourceManager
+-- MonitoringEngine
+-- SecuritySupervisor (:one_for_one)
+-- ThreatAnalyzer
+-- PerimeterScanner
+-- ComplianceChecker
+-- DDSupervisor (:one_for_one)
+-- CaseManager
+-- EntityResolver
+-- PipelineCoordinator
+-- EvolutionSupervisor (:rest_for_one)
+-- MendelEngine
+-- MycelialNetwork
+-- FitnessEvaluator

Restart Strategy Selection

The choice of restart strategy is not arbitrary:

StrategyWhen to UsePlatform Example

|----------|-------------|------------------|

:one_for_oneChildren are independentOsintSupervisor: adapters do not depend on each other :rest_for_oneChildren have ordered dependenciesCoreSupervisor: CacheManager depends on StorageCoordinator :one_for_allChildren are tightly coupledNot used in production (too aggressive)

SupervisionIntrospector

The SupervisionIntrospector is a diagnostic GenServer that continuously monitors the health of the supervision tree:


defmodule Prismatic.Singularity.SupervisionIntrospector do
use GenServer


def get_tree_health do
GenServer.call(__MODULE__, :tree_health)
end


def handle_call(:tree_health, _from, state) do
health = %{
total_processes: count_processes(),
restart_counts: collect_restart_counts(),
memory_per_supervisor: memory_breakdown(),
hotspots: identify_restart_hotspots(state.history)
}
{:reply, health, state}
end
end

It tracks restart frequency per child, identifies hotspots (children that restart more than 3 times in 5 minutes), and reports memory consumption per supervisor subtree. This data feeds into the platform's health score calculation.

RemediationRegistry

The RemediationRegistry maps known failure patterns to remediation actions. When a process crashes, the registry checks whether the crash signature matches a known pattern and applies the appropriate fix:

Remediation Actions

Restart with backoff: The default. Exponential backoff prevents rapid restart loops.

Restart with state reset: Some crashes are caused by corrupted state. Resetting to a known-good state often fixes the issue permanently.

Redirect to fallback: If a primary service is unavailable, redirect to a degraded-mode fallback.

Circuit break: If a downstream dependency is failing, open a circuit breaker to prevent cascade failures.

Escalate: If no remediation is known, escalate to the parent supervisor and log a structured alert.

Pattern Matching


def lookup_remediation(%RuntimeError{message: "connection refused"}) do
{:circuit_break, target: :external_api, duration_ms: 30_000}
end


def lookup_remediation(%DBConnection.ConnectionError{}) do
{:restart_with_backoff, initial_ms: 1_000, max_ms: 30_000}
end


def lookup_remediation(_unknown) do
{:escalate, reason: :unknown_failure_pattern}
end

Cascade Failure Prevention

The most dangerous failure mode in distributed systems is the cascade -- one component fails, which overloads another, which fails, and so on until the entire system is down.

The platform prevents cascades through three mechanisms:

1. Bulkheads

Each domain supervisor acts as a bulkhead. If the entire OSINT subsystem crashes, it does not affect DD, Security, or any other domain. The top-level supervisor uses :one_for_one precisely for this isolation.

2. Load Shedding

Under extreme load, GenServers can shed non-critical work. The platform implements this through message priority queues -- when the mailbox exceeds 1000 messages, low-priority messages are dropped with a structured warning log.

3. Circuit Breakers

External API calls are wrapped in circuit breakers with three states:

Closed (normal operation): requests flow through

Open (failure detected): requests are immediately rejected without calling the external service

Half-Open (probing): a single request is allowed through to test if the service has recovered

The LessonsLearned System

The most innovative component is the LessonsLearned system, which turns every crash into institutional knowledge:

Capture: When a process crashes, the crash report (reason, stacktrace, state snapshot) is captured

2. Classify: The crash is classified by type (state corruption, external dependency, resource exhaustion, logic error)

3. Correlate: The system checks whether similar crashes have occurred before and identifies patterns

4. Record: A structured lesson is recorded with the failure mode, root cause, and successful remediation

5. Apply: Future crashes matching the same pattern are automatically remediated using the recorded lesson

Knowledge Accumulation

Over time, the LessonsLearned database grows into a comprehensive failure knowledge base:


Total lessons recorded:     847
Auto-remediated crashes:    73% (of known patterns)
Mean time to recovery:      1.2 seconds (auto-remediated)
Escalation rate:            12% (unknown patterns requiring human review)
False positive rate:        2.1% (incorrect remediation applied)

Monitoring and Alerting

The self-healing system is observable through:

Health dashboard: Real-time supervision tree visualization at /admin/supervision

Restart metrics: :telemetry events for every restart, remediation, and escalation

Structured logs: Every crash and remediation is logged with correlation IDs for tracing

PubSub events: Components can subscribe to "system_events" for real-time health updates

The Compound Effect

Self-healing is not a feature -- it is a compound investment. Every crash that is automatically remediated saves developer time. Every lesson learned prevents future escalations. Every circuit breaker that opens prevents a cascade that would have required manual intervention.

After 19 generations of evolution, the platform's self-healing capabilities have reduced manual incident response by an estimated 85%. The remaining 15% are genuinely novel failure modes that require human creativity to resolve -- and once resolved, they too become lessons in the knowledge base.

Let it crash. Let it heal. Let it learn.

Self-Healing Systems: OTP Supervision in Practice