Supervision Overview

Supervision enables building resilient, self-healing actor systems through Erlang/OTP-style supervision trees. Kameo provides two complementary features: supervision trees for automatic parent-child restart management, and actor linking for peer-to-peer monitoring.

Supervision Trees

Supervision trees establish parent-child relationships where supervisors automatically manage child actor lifecycles. When a child fails (panics, returns an error, or exits), the supervisor decides whether and how to restart it based on configured policies.

Supervision Strategies

Strategies determine which actors restart when a failure occurs. Configure by overriding supervision_strategy() in your supervisor:

use kameo::actor::{Actor, ActorRef};
use kameo::supervision::SupervisionStrategy;

struct MySupervisor;

impl Actor for MySupervisor {
    type Args = ();
    type Error = Box<dyn std::error::Error + Send + Sync>;

    fn supervision_strategy() -> SupervisionStrategy {
        SupervisionStrategy::OneForOne  // default
    }

    async fn on_start(_: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
        Ok(MySupervisor)
    }
}

Available Strategies:

StrategyBehaviorUse When
OneForOne (default)Only restart the failed childWorkers are independent
OneForAllRestart all childrenChildren are tightly coupled
RestForOneRestart failed child + younger siblingsLater stages depend on earlier ones

Creating Supervised Children

Spawn children using supervise() or supervise_with():

use kameo::actor::{Actor, ActorRef, Spawn};
use kameo::error::Infallible;
use kameo::supervision::RestartPolicy;
use std::time::Duration;

struct Supervisor;

impl Actor for Supervisor {
    type Args = ();
    type Error = Infallible;

    async fn on_start(_: Self::Args, supervisor_ref: ActorRef<Self>) -> Result<Self, Self::Error> {
        // Spawn with cloneable args
        let worker = Worker::supervise(&supervisor_ref, Worker { count: 0 })
            .restart_policy(RestartPolicy::Permanent)
            .restart_limit(5, Duration::from_secs(10))
            .spawn()
            .await;

        // Or use a factory function for non-Clone args
        let task = Task::supervise_with(&supervisor_ref, || Task::new())
            .restart_policy(RestartPolicy::Transient)
            .spawn()
            .await;

        Ok(Supervisor)
    }
}

#[derive(Clone)]
struct Worker {
    count: u32,
}

impl Actor for Worker {
    type Args = Self;
    type Error = Infallible;
    async fn on_start(state: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
        Ok(state)
    }
}

struct Task;
impl Task {
    fn new() -> Self { Task }
}

impl Actor for Task {
    type Args = Self;
    type Error = Infallible;
    async fn on_start(state: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
        Ok(state)
    }
}

Restart Policies

Policies determine when a child should restart:

PolicyPanicsErrorsNormal ExitsUse For
Permanent (default) Restart Restart RestartCritical services
Transient Restart Restart No restartTasks that can complete
Never No restart No restart No restartOne-shot tasks, externally managed actors

Restart Limits

Prevent restart storms by limiting restart frequency. Default: 5 restarts within 5 seconds.

let worker = Worker::supervise(&supervisor_ref, Worker { count: 0 })
    .restart_limit(3, Duration::from_secs(10))  // Max 3 restarts in 10s
    .spawn()
    .await;

If a child exceeds this limit, the supervisor stops restarting it.

Spawn Options

use kameo::mailbox;

// Default mailbox (capacity 64)
.spawn().await

// Custom mailbox
.spawn_with_mailbox(mailbox::bounded(1000)).await

// Dedicated thread for blocking operations
.spawn_in_thread().await

// Thread + custom mailbox
.spawn_in_thread_with_mailbox(mailbox::bounded(500)).await

For a complete example, see examples/supervision.rs.

Actor Linking

Linking establishes peer-to-peer monitoring between actors, complementing supervision trees. Use this when supervised children need to monitor each other.

Creating and Handling Links

use std::ops::ControlFlow;
use kameo::actor::{ActorId, WeakActorRef};
use kameo::error::ActorStopReason;

// Link two actors
let worker_a = WorkerA::supervise(&supervisor_ref, WorkerA).spawn().await;
let worker_b = WorkerB::supervise(&supervisor_ref, WorkerB).spawn().await;
worker_a.link(&worker_b).await;

// Handle link failures
impl Actor for WorkerA {
    type Args = Self;
    type Error = Box<dyn std::error::Error + Send + Sync>;

    async fn on_start(state: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
        Ok(state)
    }

    async fn on_link_died(
        &mut self,
        _actor_ref: WeakActorRef<Self>,
        id: ActorId,
        reason: ActorStopReason,
    ) -> Result<ControlFlow<ActorStopReason>, Self::Error> {
        tracing::warn!("linked actor {id} died: {reason:?}");
        Ok(ControlFlow::Continue(()))  // Keep running
    }
}

Default behavior: Actors stop when a link dies (except on normal shutdown). Return ControlFlow::Continue(()) to keep running or ControlFlow::Break(reason) to stop.

Remote Linking and Unlinking

// Link remote actors
local_actor.link_remote(&remote_actor_ref).await;

// Unlink when no longer needed
worker_a.unlink(&worker_b).await;
local_actor.unlink_remote(&remote_actor_ref).await;

When a remote node disconnects, linked actors receive ActorStopReason::PeerDisconnected.

Best Practices

Supervision vs Linking:

  • Use supervision for automatic restart and lifecycle management
  • Use linking for peer notification without restart, or cross-node monitoring
  • Combine both when supervised children need to coordinate

Designing Trees:

  • Keep trees shallow (2-3 levels)
  • Group related actors under one supervisor
  • Use OneForOne for independent workers, OneForAll for coupled services
  • Place critical services higher in the tree

Restart Configuration:

  • Use Permanent for long-running services (default)
  • Use Transient for tasks that can complete normally
  • Use Never for one-shot tasks or actors whose lifecycle is managed externally
  • Adjust limits based on restart patterns: relaxed (10/60s) for flaky dependencies, strict (2/5s) for fast-failing actors

Summary

Supervision in Kameo provides two mechanisms for resilient systems: supervision trees manage parent-child relationships with automatic restart using configurable policies and strategies, while actor linking enables peer-to-peer monitoring with custom failure handling. Together, they embody the "let it crash" philosophy for building self-healing applications.