Distributed Supervisors

June 12, 2024 - 8 mins read

ExUnit tags

I’m working toward my goal of respawning processes across the cluster and I want to make small changes while continuously running my test suite to check for regressions. I have a failing test from the other day that will help steer me in the direction I need to go, but I don’t want the noise of this failing test whenever I run the full suite. I add a moduletag that will tell ExUnit to skip all tests in this module:

defmodule Minotaur.Cluster.GameSessionTest do
  @moduledoc false

  use ExUnit.ClusteredCase
  # @moduletag :skip

  # ...
end

When I want to include the cluster tests in my test run, I can either remove the tag or run mix test --include skip:true.

Child specification

In order to use the distributed supervisor module provided by the Horde library, I’ll need to adapt the way game session processes are started in the application. The processes are currently started in the function continue_game/1 which calls start_child/2 from the DynamicSupervisor module and passes a child_spec argument using the shorthand tuple syntax:

def continue_game(%Game{} = game) do
  spec = {SessionServer, game_state: game}
  DynamicSupervisor.start_child(SessionSupervisor, spec)
end

This format of child_spec is unwrapped under the hood to the map version:

%{
  id: SessionServer,
  start: { SessionServer, :start_link, [[game_state: game]]}
}

The :id value is ignored by the Elixir DynamicSupervisor when starting child processes so the default value (the module which defines start_link) worked fine. However, the Horde.DistributedSupervisor module uses :id to track processes across the cluster so the child_spec will need to be updated so :id is unique to each game session.

I create a child_spec/1 function within the SessionServer module to keep the logic close to the other related process initialization functions.

defmodule Minotaur.GameEngine.SessionServer do
  # ...

  def child_spec(opts) do
    game = Keyword.get(opts, :game_state)

    %{
      id: "#{__MODULE__}_#{game.id}",
      start: {__MODULE__, :start_link, [[game_state: game]]}
    }
  end
end

I then updated continue_game/1 to use the return of new the new child_spec/1 function for the child_spec argument in start_child/2

defmodule Minotaur.GameEngine do
  # ...

  def continue_game(%Game{} = game) do
    spec = SessionServer.child_spec(game_state: game)
    DynamicSupervisor.start_child(SessionSupervisor, spec)
  end
end

This shouldn’t change anything with how the existing session supervisor starts dynamic game session processes. I run my test suite and confirm nothing breaks.

Distributed Supervisors

It is time to pull in the Horde library. I add the latest version 0.9.0 to mix.exs and install with mix deps.get. In Minotaur.Application, I update the existing SessionSupervisor in the start up supervision tree to use Horde.DynamicSupervisor:

defmodule Mintoaur.Application do
  # ...

  def start(_, _) do
    children = [
      # ...
      {Horde.DynamicSupervisor, strategy: :one_for_one, name: Minotaur.GameEngine.SessionSupervisor},
      # ...
    ]

    opts = [strategy: :one_for_one, name: Minotaur.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Now I update continue_game/1 to use the Horde supervisor:

defmodule Minotaur.GameEngine do
  # ...

  def continue_game(%Game{} = game) do
    spec = SessionServer.child_spec(game_state: game)
    Horde.DynamicSupervisor.start_child(SessionSupervisor, spec)
  end
end

I run a simple test case for continue_game/1 to make sure it’s still working properly on a single node. Unfortunately, the process hangs and times out after 1 minute.

I spin up an Elixir REPL to see what might be causing the issue, but the process starts up without a problem.

iex(1)> alias Minotaur.GameEngine, as: GE
Minotaur.GameEngine

iex(2)> round_end = DateTime.add(DateTime.utc_now(), 1, :day)
~U[2024-06-13 23:18:19.518430Z]

iex(3)> game = %GE.Game{id: "ABC", round_end_time: round_end}
%Minotaur.GameEngine.Game{
  id: "ABC",
  round_end_time: ~U[2024-06-13 23:18:19.518430Z],
  world: nil,
  players: nil,
  registered_actions: %{},
  round: 1
}

iex(4)> GE.continue_game(game)
{:ok, #PID<0.804.0>}

iex(5)> GE.get_game("ABC")
{:ok,
 %Minotaur.GameEngine.Game{
   id: "ABC",
   round_end_time: ~U[2024-06-13 23:18:19.518430Z],
   world: nil,
   players: nil,
   registered_actions: %{},
   round: 1
 }}

I drop an IEx.pry() inside the test case that is hanging and immediately notice the node name in the shell which means I’m connected as a node within a cluster:

Interactive Elixir (1.16.2) - press Ctrl+C to exit (type h() ENTER for help)
iex([email protected])1>

I go to test_helper.exs and comment out the line added to turn the running node distributed for cluster testing.

# Setup for cluster testing
{_, 0} = System.cmd("epmd", ["-daemon"])
# Node.start(:"[email protected]", :longnames)

ExUnit.start()

I run the test again and it passes, but I’m not sure why this is the case.

I try to isolate the issue by toggling a few variables that are working together here.

Remove all Horde code - Tests all pass
Remove ex_unit_clustered_case code - Some tests fail
Start iex first and run Node.start(:"[email protected]", :longnames) - Tests all pass (except cluster tests). Probably because Node.start() fails in test_helpers.exs.
Start iex, then attempt to continue_game within the shell - Returns successfully
Start iex, run Node.start(:"[email protected]", :longnames), then attempt to continue_game within the shell - Still hangs like in test

There is something going wrong with Horde only when the current node is in distributed mode. I check the Horde docs again to see if there is something I missed about how the supervisor works within a cluster.

Following a lead

I find a few seemingly contradictory notes about clusters in the docs. For example, the Getting Started page states:

Horde assumes that you will manage Erlang clustering yourself. There are libraries that will help you do this.

However, the docs for Horde.DynamicSupervisor state:

Cluster membership is managed with Horde.Cluster. Joining a cluster can be done with Horde.Cluster.set_members/2. To take a node out of the cluster, call Horde.Cluster.set_members/2 without that node in the list. Alternatively, setting the members startup option to :auto will make Horde auto-manage cluster membership so that all (and only) visible nodes are members of the cluster.

Initially these pages seem to be contradictory, but I believe there is a distinction between cluster management and Horde’s cluster membership management. If I’m understanding this correctly, nodes can be added to an Elixir/Erlang cluster which makes them “visible” to Horde for cluster membership.

Reading the Getting Started docs had a confusing line about “static membership” in the cluster which made me think it did not apply to my dynamic cluster case:

This is also where you can set additional options for Horde.DynamicSupervisor:
:members, a list of members (if your cluster will have static membership)

Another section on the Setting up a Cluster page makes it more clear that I need to use members: :auto for supervisors to have them added to the cluster membership. This seems to still apply even though I’m using dns_cluster instead of libcluster to manage dynamic nodes in the cluster.

Horde doesn’t provide functionality to set up your cluster, we recommend you use libcluster for this purpose.
There are three strategies you can use to integrate libcluster with Horde:
Automatic Cluster Membership
When starting a Horde.Registry or Horde.DynamicSupervisor, setting the members option to have a value of :auto will automate membership management. In this mode, all visible nodes will be initially added to the cluster. In addition, any new nodes that become visible will be automatically added and any nodes that shut down will be automatically removed.

I update my application supervisor tree with this change:

defmodule Mintoaur.Application do
  # ...

  def start(_, _) do
    children = [
      # ...
      {Horde.DynamicSupervisor, strategy: :one_for_one, name: Minotaur.GameEngine.SessionSupervisor, members: :auto},
      # ...
    ]

    # ...
  end
end

And all the existing tests are now passing!

Back to the cluster test

Now that the distributed supervisor is managing game session processes, the failing test case I wrote yesterday can be used to validate the behavior of the session being respawned on a second node after the first is shutdown. I remove the moduletag for skipping tests since I now want to include it in my test suite runs.

Here is the test case:

scenario "when a node shuts down while having an active game session", @cluster_opts do
  setup [:start_apps, :start_game_on_node1]

  test "game session is restarted on another node", ctx do
    Cluster.stop_node(ctx.cluster, ctx.node1)

    res =
      Cluster.call(ctx.node2, fn ->
        Minotaur.GameEngine.get_game(ctx.game.id)
      end)

    assert {:ok, %{id: game_id}} = res
    assert ctx.game.id == game_id
  end
end

The test is still failing, showing the {:error, :game_not_alive} result as before. Nowhere in the test case do I allow time for the supervisor to do its job before the assertions are run. I throw in a line :timer.sleep(100) just before the assertions to see if I’m on the right path. Sure enough, the test passes!

I’m not a fan of hard coding a sleep timer since the amount of time it takes for the process to spawn could fluxuate over time. I know the idomatic way to wait for async tasks to run is to use assert_receive and listen for a :DOWN signal, but in this case I’m waiting for a task to complete which is triggered by an internal supervisor. I do find a solution on Stack Overflow of creating a helper module to loop over the assertion until it either passes or the timeout expires. It may not be idiomatic Elixir, but I strongly prefer it over a hard coded timer.

I create the time helper module and update the test:

test "game session is restarted on another node", ctx do
  Cluster.stop_node(ctx.cluster, ctx.node1)

  wait_until(fn ->
    res =
      Cluster.call(ctx.node2, fn ->
        Minotaur.GameEngine.get_game(ctx.game.id)
      end)

    assert {:ok, %{id: game_id}} = res
    assert ctx.game.id == game_id
  end)
end

The test passes!

I run the test a few more times, but I get intermittent failures with {:error, :game_not_alive} using the default 500ms timeout for wait_util/2. I keep increasing the wait time until I get to 10_000ms, but I still see intermittent cases where the test fails. It seems that Horde is not always starting the process on the new node as it should. I’ll have to dive deeper into this mystery next time.