Distributed Supervisors
ExUnit tags
I’m working toward my goal of respawning processes across the cluster and I want to make small changes while continuously running my test suite to check for regressions.
I have a failing test from the other day that will help steer me in the direction I need to go, but I don’t want the noise of this failing test whenever I run the full suite.
I add a moduletag
that will tell ExUnit
to skip all tests in this module:
defmodule Minotaur.Cluster.GameSessionTest do
@moduledoc false
use ExUnit.ClusteredCase
# @moduletag :skip
# ...
end
When I want to include the cluster tests in my test run, I can either remove the tag or run mix test --include skip:true
.
Child specification
In order to use the distributed supervisor module provided by the Horde library, I’ll need to adapt the way game session processes are started in the application.
The processes are currently started in the function continue_game/1
which calls start_child/2
from the DynamicSupervisor
module and passes a child_spec
argument using the shorthand tuple syntax:
def continue_game(%Game{} = game) do
spec = {SessionServer, game_state: game}
DynamicSupervisor.start_child(SessionSupervisor, spec)
end
This format of child_spec
is unwrapped under the hood to the map version:
%{
id: SessionServer,
start: { SessionServer, :start_link, [[game_state: game]]}
}
The :id
value is ignored by the Elixir DynamicSupervisor
when starting child processes so the default value (the module which defines start_link
) worked fine.
However, the Horde.DistributedSupervisor
module uses :id
to track processes across the cluster so the child_spec
will need to be updated so :id
is unique to each game session.
I create a child_spec/1
function within the SessionServer
module to keep the logic close to the other related process initialization functions.
defmodule Minotaur.GameEngine.SessionServer do
# ...
def child_spec(opts) do
game = Keyword.get(opts, :game_state)
%{
id: "#{__MODULE__}_#{game.id}",
start: {__MODULE__, :start_link, [[game_state: game]]}
}
end
end
I then updated continue_game/1
to use the return of new the new child_spec/1
function for the child_spec
argument in start_child/2
defmodule Minotaur.GameEngine do
# ...
def continue_game(%Game{} = game) do
spec = SessionServer.child_spec(game_state: game)
DynamicSupervisor.start_child(SessionSupervisor, spec)
end
end
This shouldn’t change anything with how the existing session supervisor starts dynamic game session processes. I run my test suite and confirm nothing breaks.
Distributed Supervisors
It is time to pull in the Horde library.
I add the latest version 0.9.0
to mix.exs
and install with mix deps.get
.
In Minotaur.Application
, I update the existing SessionSupervisor
in the start up supervision tree to use Horde.DynamicSupervisor
:
defmodule Mintoaur.Application do
# ...
def start(_, _) do
children = [
# ...
{Horde.DynamicSupervisor, strategy: :one_for_one, name: Minotaur.GameEngine.SessionSupervisor},
# ...
]
opts = [strategy: :one_for_one, name: Minotaur.Supervisor]
Supervisor.start_link(children, opts)
end
end
Now I update continue_game/1
to use the Horde supervisor:
defmodule Minotaur.GameEngine do
# ...
def continue_game(%Game{} = game) do
spec = SessionServer.child_spec(game_state: game)
Horde.DynamicSupervisor.start_child(SessionSupervisor, spec)
end
end
I run a simple test case for continue_game/1
to make sure it’s still working properly on a single node.
Unfortunately, the process hangs and times out after 1 minute.
I spin up an Elixir REPL to see what might be causing the issue, but the process starts up without a problem.
iex(1)> alias Minotaur.GameEngine, as: GE
Minotaur.GameEngine
iex(2)> round_end = DateTime.add(DateTime.utc_now(), 1, :day)
~U[2024-06-13 23:18:19.518430Z]
iex(3)> game = %GE.Game{id: "ABC", round_end_time: round_end}
%Minotaur.GameEngine.Game{
id: "ABC",
round_end_time: ~U[2024-06-13 23:18:19.518430Z],
world: nil,
players: nil,
registered_actions: %{},
round: 1
}
iex(4)> GE.continue_game(game)
{:ok, #PID<0.804.0>}
iex(5)> GE.get_game("ABC")
{:ok,
%Minotaur.GameEngine.Game{
id: "ABC",
round_end_time: ~U[2024-06-13 23:18:19.518430Z],
world: nil,
players: nil,
registered_actions: %{},
round: 1
}}
I drop an IEx.pry()
inside the test case that is hanging and immediately notice the node name in the shell which means I’m connected as a node within a cluster:
Interactive Elixir (1.16.2) - press Ctrl+C to exit (type h() ENTER for help)
iex([email protected])1>
I go to test_helper.exs
and comment out the line added to turn the running node distributed for cluster testing.
# Setup for cluster testing
{_, 0} = System.cmd("epmd", ["-daemon"])
# Node.start(:"[email protected]", :longnames)
ExUnit.start()
I run the test again and it passes, but I’m not sure why this is the case.
I try to isolate the issue by toggling a few variables that are working together here.
- Remove all Horde code - Tests all pass
- Remove
ex_unit_clustered_case
code - Some tests fail - Start
iex
first and runNode.start(:"[email protected]", :longnames)
- Tests all pass (except cluster tests). Probably because Node.start() fails in test_helpers.exs. - Start
iex
, then attempt tocontinue_game
within the shell - Returns successfully - Start
iex
, runNode.start(:"[email protected]", :longnames)
, then attempt tocontinue_game
within the shell - Still hangs like in test
There is something going wrong with Horde only when the current node is in distributed mode. I check the Horde docs again to see if there is something I missed about how the supervisor works within a cluster.
Following a lead
I find a few seemingly contradictory notes about clusters in the docs. For example, the Getting Started page states:
Horde assumes that you will manage Erlang clustering yourself. There are libraries that will help you do this.
However, the docs for Horde.DynamicSupervisor state:
Cluster membership is managed with Horde.Cluster. Joining a cluster can be done with Horde.Cluster.set_members/2. To take a node out of the cluster, call Horde.Cluster.set_members/2 without that node in the list. Alternatively, setting the members startup option to :auto will make Horde auto-manage cluster membership so that all (and only) visible nodes are members of the cluster.
Initially these pages seem to be contradictory, but I believe there is a distinction between cluster management and Horde’s cluster membership management. If I’m understanding this correctly, nodes can be added to an Elixir/Erlang cluster which makes them “visible” to Horde for cluster membership.
Reading the Getting Started docs had a confusing line about “static membership” in the cluster which made me think it did not apply to my dynamic cluster case:
This is also where you can set additional options for Horde.DynamicSupervisor:
- :members, a list of members (if your cluster will have static membership)
Another section on the Setting up a Cluster page makes it more clear that I need to use members: :auto
for supervisors to have them added to the cluster membership.
This seems to still apply even though I’m using dns_cluster
instead of libcluster
to manage dynamic nodes in the cluster.
Horde doesn’t provide functionality to set up your cluster, we recommend you use libcluster for this purpose.
There are three strategies you can use to integrate libcluster with Horde:
Automatic Cluster Membership
When starting a Horde.Registry or Horde.DynamicSupervisor, setting the members option to have a value of :auto will automate membership management. In this mode, all visible nodes will be initially added to the cluster. In addition, any new nodes that become visible will be automatically added and any nodes that shut down will be automatically removed.
I update my application supervisor tree with this change:
defmodule Mintoaur.Application do
# ...
def start(_, _) do
children = [
# ...
{Horde.DynamicSupervisor, strategy: :one_for_one, name: Minotaur.GameEngine.SessionSupervisor, members: :auto},
# ...
]
# ...
end
end
And all the existing tests are now passing!
Back to the cluster test
Now that the distributed supervisor is managing game session processes, the failing test case I wrote yesterday can be used to validate the behavior of the session being respawned on a second node after the first is shutdown.
I remove the moduletag
for skipping tests since I now want to include it in my test suite runs.
Here is the test case:
scenario "when a node shuts down while having an active game session", @cluster_opts do
setup [:start_apps, :start_game_on_node1]
test "game session is restarted on another node", ctx do
Cluster.stop_node(ctx.cluster, ctx.node1)
res =
Cluster.call(ctx.node2, fn ->
Minotaur.GameEngine.get_game(ctx.game.id)
end)
assert {:ok, %{id: game_id}} = res
assert ctx.game.id == game_id
end
end
The test is still failing, showing the {:error, :game_not_alive}
result as before.
Nowhere in the test case do I allow time for the supervisor to do its job before the assertions are run.
I throw in a line :timer.sleep(100)
just before the assertions to see if I’m on the right path.
Sure enough, the test passes!
I’m not a fan of hard coding a sleep timer since the amount of time it takes for the process to spawn could fluxuate over time.
I know the idomatic way to wait for async tasks to run is to use assert_receive
and listen for a :DOWN
signal, but in this case I’m waiting for a task to complete which is triggered by an internal supervisor.
I do find a solution on Stack Overflow of creating a helper module to loop over the assertion until it either passes or the timeout expires.
It may not be idiomatic Elixir, but I strongly prefer it over a hard coded timer.
I create the time helper module and update the test:
test "game session is restarted on another node", ctx do
Cluster.stop_node(ctx.cluster, ctx.node1)
wait_until(fn ->
res =
Cluster.call(ctx.node2, fn ->
Minotaur.GameEngine.get_game(ctx.game.id)
end)
assert {:ok, %{id: game_id}} = res
assert ctx.game.id == game_id
end)
end
The test passes!
I run the test a few more times, but I get intermittent failures with {:error, :game_not_alive}
using the default 500ms timeout for wait_util/2
.
I keep increasing the wait time until I get to 10_000ms, but I still see intermittent cases where the test fails.
It seems that Horde is not always starting the process on the new node as it should.
I’ll have to dive deeper into this mystery next time.