Cluster Testing
What’s Next?
The overall milestone I’m working toward with Minotaur development is to create a workflow which allows me to push out frequent changes to the development server without causing any current games in session to be affected. Let’s review the individual problems needed to reach that goal:
- Launch the application as nodes in an Erlang Cluster on deploy.
- Start new game session processes on a new version node to replace the sessions of the old version node.
- LiveView processes continue to communicate with the game sessions after session processes have migrated to new nodes.
- State for each game session is stashed before the old session process terminates.
- New game session processes initialize with the stashed state for the respective session.
Yesterday, I solve the problem of automatic clustering of Elixir nodes when deploying a new version of Minotaur to the the Docker swarm service. The next item to tackle is migrating the active game session processes to the new node after deploy. The complicated piece of this migration will be moving the game session state, but for now I’m only focusing on getting the game session process to start on the new node after the node for the old version is signaled to shutdown.
My initial research into how to solve this problem brought me to the Horde library which offers distributed Supervisor and Registry modules to sync state across the cluster.
Minotaur currently uses the native DynamicSupervisor
module to supervise LobbySession and GameSession server processes and Registry
to track processes by a given id.
It seems like I can swap out the native modules with Horde.DynamicSupervisor
and Horde.Registry
to make this work.
I’m excited to try it out, but one question that comes to mind at this point is how can I write tests for this?
Writing Tests for Distributed Elixir
I found two libraries which offer ways to write tests with clustered applications, ex_unit_clustered_case and local-cluster.
local-cluster
seems to be built with simplicity in mind and has a very minimal API.
ex_unit_clustered_case
, on the other hand, offers more options for simulating cluster scenarios like partitioning and parition healing.
I like the options provided by ex_unit_clustered_case
so we’ll take that for a spin next.
I follow the installation instructions and create a simple test case to ensure the library is configured correctly:
defmodule Minotaur.Cluster.GameSessionTest do
@moduledoc false
use ExUnit.ClusteredCase
# use ExUnit.Case
scenario "cluster setup", cluster_size: 2 do
test "example test", %{cluster: cluster} do
[node1, node2] = Cluster.members(cluster)
assert :pong == Node.ping(node1)
assert :pong == Node.ping(node2)
end
end
end
But there is an error somewhere in the test setup:
** (throw) {:error, {:cluster_start, ["[email protected]": :boot_timeout, "[email protected]": :boot_timeout]}}
stacktrace:
(ex_unit_clustered_case 0.5.0) lib/cluster/supervisor.ex:35: ExUnit.ClusteredCase.Cluster.Supervisor.init_cluster_for_scenario!/3
test/cluster/game_session_test.exs:6: Minotaur.Cluster.GameSessionTest.__ex_unit_setup_0_0/1
Minotaur.Cluster.GameSessionTest.__ex_unit_describe_0/1
The error seems to show a list with node names and the associated error when trying to start them in the cluster.
In the docs, I see :boot_timeout
is one of the cluster options that can be passed to scenario
so I want to try to increase the default timeout.
I don’t find the default timeout in the docs, but I’m able to track it down in the source code for the library which turns out to be 2,000 milliseconds.
I update the scenario args with cluster_size: 2, boot_timeout: 3_000
and this time the test passes!
I update the timeout to 4_000
just to give myself extra buffer so this doesn’t catch me off guard down the road if the boot time slows down for some reason.
With the cluster testing foundation laid down, I am now in a good position to start tackling actual problems.