Supervision Overview
Supervision enables building resilient, self-healing actor systems through Erlang/OTP-style supervision trees. Kameo provides two complementary features: supervision trees for automatic parent-child restart management, and actor linking for peer-to-peer monitoring.
Supervision Trees
Supervision trees establish parent-child relationships where supervisors automatically manage child actor lifecycles. When a child fails (panics, returns an error, or exits), the supervisor decides whether and how to restart it based on configured policies.
Supervision Strategies
Strategies determine which actors restart when a failure occurs. Configure by overriding supervision_strategy() in your supervisor:
use kameo::actor::{Actor, ActorRef};
use kameo::supervision::SupervisionStrategy;
struct MySupervisor;
impl Actor for MySupervisor {
type Args = ();
type Error = Box<dyn std::error::Error + Send + Sync>;
fn supervision_strategy() -> SupervisionStrategy {
SupervisionStrategy::OneForOne // default
}
async fn on_start(_: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
Ok(MySupervisor)
}
}
Available Strategies:
| Strategy | Behavior | Use When |
|---|---|---|
OneForOne (default) | Only restart the failed child | Workers are independent |
OneForAll | Restart all children | Children are tightly coupled |
RestForOne | Restart failed child + younger siblings | Later stages depend on earlier ones |
Creating Supervised Children
Spawn children using supervise() or supervise_with():
use kameo::actor::{Actor, ActorRef, Spawn};
use kameo::error::Infallible;
use kameo::supervision::RestartPolicy;
use std::time::Duration;
struct Supervisor;
impl Actor for Supervisor {
type Args = ();
type Error = Infallible;
async fn on_start(_: Self::Args, supervisor_ref: ActorRef<Self>) -> Result<Self, Self::Error> {
// Spawn with cloneable args
let worker = Worker::supervise(&supervisor_ref, Worker { count: 0 })
.restart_policy(RestartPolicy::Permanent)
.restart_limit(5, Duration::from_secs(10))
.spawn()
.await;
// Or use a factory function for non-Clone args
let task = Task::supervise_with(&supervisor_ref, || Task::new())
.restart_policy(RestartPolicy::Transient)
.spawn()
.await;
Ok(Supervisor)
}
}
#[derive(Clone)]
struct Worker {
count: u32,
}
impl Actor for Worker {
type Args = Self;
type Error = Infallible;
async fn on_start(state: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
Ok(state)
}
}
struct Task;
impl Task {
fn new() -> Self { Task }
}
impl Actor for Task {
type Args = Self;
type Error = Infallible;
async fn on_start(state: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
Ok(state)
}
}
Restart Policies
Policies determine when a child should restart:
| Policy | Panics | Errors | Normal Exits | Use For |
|---|---|---|---|---|
Permanent (default) | ✅ Restart | ✅ Restart | ✅ Restart | Critical services |
Transient | ✅ Restart | ✅ Restart | ❌ No restart | Tasks that can complete |
Never | ❌ No restart | ❌ No restart | ❌ No restart | One-shot tasks, externally managed actors |
Restart Limits
Prevent restart storms by limiting restart frequency. Default: 5 restarts within 5 seconds.
let worker = Worker::supervise(&supervisor_ref, Worker { count: 0 })
.restart_limit(3, Duration::from_secs(10)) // Max 3 restarts in 10s
.spawn()
.await;
If a child exceeds this limit, the supervisor stops restarting it.
Spawn Options
use kameo::mailbox;
// Default mailbox (capacity 64)
.spawn().await
// Custom mailbox
.spawn_with_mailbox(mailbox::bounded(1000)).await
// Dedicated thread for blocking operations
.spawn_in_thread().await
// Thread + custom mailbox
.spawn_in_thread_with_mailbox(mailbox::bounded(500)).await
For a complete example, see examples/supervision.rs.
Actor Linking
Linking establishes peer-to-peer monitoring between actors, complementing supervision trees. Use this when supervised children need to monitor each other.
Creating and Handling Links
use std::ops::ControlFlow;
use kameo::actor::{ActorId, WeakActorRef};
use kameo::error::ActorStopReason;
// Link two actors
let worker_a = WorkerA::supervise(&supervisor_ref, WorkerA).spawn().await;
let worker_b = WorkerB::supervise(&supervisor_ref, WorkerB).spawn().await;
worker_a.link(&worker_b).await;
// Handle link failures
impl Actor for WorkerA {
type Args = Self;
type Error = Box<dyn std::error::Error + Send + Sync>;
async fn on_start(state: Self::Args, _: ActorRef<Self>) -> Result<Self, Self::Error> {
Ok(state)
}
async fn on_link_died(
&mut self,
_actor_ref: WeakActorRef<Self>,
id: ActorId,
reason: ActorStopReason,
) -> Result<ControlFlow<ActorStopReason>, Self::Error> {
tracing::warn!("linked actor {id} died: {reason:?}");
Ok(ControlFlow::Continue(())) // Keep running
}
}
Default behavior: Actors stop when a link dies (except on normal shutdown). Return ControlFlow::Continue(()) to keep running or ControlFlow::Break(reason) to stop.
Remote Linking and Unlinking
// Link remote actors
local_actor.link_remote(&remote_actor_ref).await;
// Unlink when no longer needed
worker_a.unlink(&worker_b).await;
local_actor.unlink_remote(&remote_actor_ref).await;
When a remote node disconnects, linked actors receive ActorStopReason::PeerDisconnected.
Best Practices
Supervision vs Linking:
- Use supervision for automatic restart and lifecycle management
- Use linking for peer notification without restart, or cross-node monitoring
- Combine both when supervised children need to coordinate
Designing Trees:
- Keep trees shallow (2-3 levels)
- Group related actors under one supervisor
- Use
OneForOnefor independent workers,OneForAllfor coupled services - Place critical services higher in the tree
Restart Configuration:
- Use
Permanentfor long-running services (default) - Use
Transientfor tasks that can complete normally - Use
Neverfor one-shot tasks or actors whose lifecycle is managed externally - Adjust limits based on restart patterns: relaxed (10/60s) for flaky dependencies, strict (2/5s) for fast-failing actors
Summary
Supervision in Kameo provides two mechanisms for resilient systems: supervision trees manage parent-child relationships with automatic restart using configurable policies and strategies, while actor linking enables peer-to-peer monitoring with custom failure handling. Together, they embody the "let it crash" philosophy for building self-healing applications.
