Cluster Testing Side Quest

June 15, 2024 - 4 mins read

When the solution becomes its own problem

My current goal is to fix my cluster test so that it more closely reflects the scenario of restarting a game session on a new node during a rolling deploy. I have been tinkering with spawning a slave node and manually joining the cluster provided by ex_unit_clustered_case:

{:ok, node2} = :slave.start_link(:"127.0.0.1", :node2)

# Add code paths to new node
:code.get_path()
|> Enum.each(fn path ->
  :rpc.call(node, :code, :add_path, [path])
end)

# Join node2 to cluster
:rpc.call(ctx.node1, Node, :connect, [node2])

# Start
:rpc.call(node2, Application, :ensure_all_started, [:minotaur])

However, I am getting an error when starting up the application on the new node.

21:27:59.296 [notice] Application ex_unit_clustered_case exited: exited in: ExUnit.ClusteredCase.App.start(:normal, [])
    ** (EXIT) an exception was raised:
        ** (MatchError) no match of right hand side value: {:error, {{:badmatch, {:error, :eaddrinuse}}, [{:erl_boot_server, :init, 1, [file: ~c"erl_boot_server.erl", line: 188]}, {:gen_server, :init_it, 2, [file: ~c"gen_server.erl", line: 980]}, {:gen_server, :init_it, 6, [file: ~c"gen_server.erl", line: 935]}, {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 241]}]}}
            (ex_unit_clustered_case 0.5.0) lib/app.ex:13: ExUnit.ClusteredCase.App.start/2
            (kernel 9.2.4) application_master.erl:293: :application_master.start_it_old/4

It seems to be an issue with ex_unit_clustered_case being started with the application in the test environment when another instance is already started on the current node. Running my own custom nodes does not play well with this library. After spending far to much time looking for workarounds, I realize that this library isn’t really providing me with much value for what I’m currently trying to test and is now getting in my way.

Taking a look at LocalCluster

Now that I’ve gone through a deeper dive on manually creating nodes and joining to a cluster, I could just handle this logic in my own code. There is also the alternative library LocalCluster which is a very simple implementation of this same logic and provides a nice interface to work with. I will take this option for a spin and if it proves to be troublesome, I’ll roll my own solution.

I start with a very simple case to see if the tool supports spawning new nodes to join after a cluster has been started.

test "demo local_cluster" do
  [node1] = LocalCluster.start_nodes(:v1_, 1)
  [node2] = LocalCluster.start_nodes(:v2_, 1)
  
  res = :rpc.call(node1, Node, :list, [])
  IO.inspect(res)
  
  res = :rpc.call(node2, Node, :list, [])
  IO.inspect(res)
end

And the logging result:

22:37:24.127 [error] Running MinotaurWeb.Endpoint with Bandit 1.5.0 at http failed, port 4002 already in use

22:37:26.892 [error] Running MinotaurWeb.Endpoint with Bandit 1.5.0 at http failed, port 4002 already in use
[:"[email protected]", :"[email protected]"]
[:"[email protected]", :"[email protected]"]

There are a few things to note here. First, the new node does join the existing cluster! It seems the current node where the tests are run is named manager and is also joined to the cluster. This is different from the experience with ex_unit_clustered_case, but that library was using a more complex setup to keep the manager node from joining the cluster.

Another thing to note with the result is the error logs for the port being in use. By default, a Phoenix application running in the test environment doesn’t have the server option enabled so the port wouldn’t be in use for tests. However, I enabled the server option to support feature tests which require sending requests to the endpoint within test cases. I might want to find a way to isolate the feature tests so they run in a separate environment from unit tests where the endpoint server is not needed, but for now I can get around this issue by disabling the server for these new nodes.

LocalCluster provides an option to override the application environment of the cluster nodes before the applications are started. This will take care of the issue with the duplicate endpoint ports by disabling the server option completely. An alternative would be to generate a unique port for each node using the same cluster environment option, but I will wait until I find a need to run endpoint servers within the cluster tests before I go that route.

test "demo local_cluster" do
  endpoint_env = Application.get_env(:minotaur, MinotaurWeb.Endpoint, [])
  new_env = Keyword.put(endpoint_env, :server, false)
  cluster_env = [minotaur: [{MinotaurWeb.Endpoint, new_env}]]

  [node1] = LocalCluster.start_nodes(:v1_, 1, environment: cluster_env)
  [node2] = LocalCluster.start_nodes(:v2_, 1, environment: cluster_env)

  res = :rpc.call(node1, Minotaur.GameEngine, :get_game, ["ABCD"])
  IO.inspect(res)
end

The logging output shows {:error, :game_not_alive} which is expected since I am only checking that the code is running on the “remote” node.

The next thing I need to test is if the manager node (the current vm) is going to have its own Minotaur application code started which would be a problem. Horde.DynamicSupervisor distributes its child processes across nodes in the cluster and we want to ensure there is only one node in the cluster in the test setup. I suspect the manager node is starting the application since both of the two spawn nodes ran into the error where the hardcoded port 4002 was in use (the manager already used it). I confirm that this is true by running :rpc.call("[email protected]", Minotaur.GameEngine, :get_game, ["ABCD"]) and getting the same {:error, :game_not_alive} result. I’ll have to find a solution to this before I can rely on this option for cluster testing.