Writing the first test

Excited to start writing tests for the cluster behavior, I wrote up the skeleton for my first test case:

@cluster_opts [cluster_size: 2, boot_timeout: 4_000]

scenario "when a node shuts down while having an active game session", @cluster_opts do
  test "game session is restarted on another node", _ctx do
    assert false
  end
end

The code compiles and the test fails as expected. Next, I check that I can interact with the nodes using the interface provided by ex_unit_clustered_case. I pick a function in Minotaur.GameEngine module to run on a node to see that I get an expected result. With a clean application state, there shouldn’t be any game processes.

test "game session is restarted on another node", %{cluster: cluster} do
  [node1, _node2] = Cluster.members(cluster)
  res = Cluster.call(node1, Minotaur.GameEngine, :get_game, ["ABCD"])
  assert {:error, :game_not_alive} == res
end

The test fails.

1) test when a node shuts down while having an active game session game session is restarted on another node (Minotaur.Cluster.GameSessionTest)
   test/cluster/game_session_test.exs:9
   Assertion with == failed
   code:  assert {:error, :game_not_alive} == res
   left:  {:error, :game_not_alive}
   right: {:error, %ArgumentError{message: "unknown registry: Minotaur.GameEngine.SessionRegistry"}}
   stacktrace:
     test/cluster/game_session_test.exs:12: (test)

The result of calling :get_game on the node is {:error, %ArgumentError{message: "unknown registry: Minotaur.GameEngine.SessionRegistry"}} which means the registry is likely not started. This makes sense since the nodes have been started, but not the Minotaur application on each node. The README for ex_unit_clustered_case shows exactly how to do this with a convenience function node_setup which works like the standard setup callback in ExUnit, but applies the callback to each node in the cluster. I update the scenario block just like the provided README example:

scenario "when a node shuts down while having an active game session", @cluster_opts do
  node_setup [:start_apps]

  test "game session is restarted on another node", %{cluster: cluster} do
    [node1, _node2] = Cluster.members(cluster)
    res = Cluster.call(node1, Minotaur.GameEngine, :get_game, ["ABCD"])
    assert {:error, :game_not_alive} == res
  end
end

defp start_apps(ctx) do
  Application.ensure_all_started(:minotaur)
end

However, running the test runs into a compilation error:

== Compilation error in file test/cluster/game_session_test.exs ==
** (FunctionClauseError) no function clause matching in ExUnit.ClusteredCase.node_setup/1    
    (ex_unit_clustered_case 0.5.0) expanding macro: ExUnit.ClusteredCase.node_setup/1
    test/cluster/game_session_test.exs:9: Minotaur.Cluster.GameSessionTest (module)
    (ex_unit_clustered_case 0.5.0) expanding macro: ExUnit.ClusteredCase.node_setup/1
    test/cluster/game_session_test.exs:9: Minotaur.Cluster.GameSessionTest (module)
    (ex_unit 1.16.2) expanding macro: ExUnit.Case.describe/2
    test/cluster/game_session_test.exs:8: Minotaur.Cluster.GameSessionTest (module)
    (ex_unit_clustered_case 0.5.0) expanding macro: ExUnit.ClusteredCase.scenario/3
    test/cluster/game_session_test.exs:8: Minotaur.Cluster.GameSessionTest (module

It doesn’t like the use of node_setup which is a macro being imported within the using macro for ExUnit.ClusteredCase module. I can’t find this particular issue online and the repo has been quiet for the past 2 years. I poke around the source code and believe the issue is coming from a recursive call within the macro definition when matching for a callback as a List:

defmacro node_setup(callbacks) when is_list(callbacks) do
  quote bind_quoted: [callbacks: callbacks] do
    for cb <- callbacks do
      unless is_atom(cb) do
        raise ArgumentError, "expected list of callbacks as atoms, but got: #{callbacks}"
      end

      node_setup(cb)
    end
  end
end

As much as my curiousity is pushing me to try to find a fix or reach out to the library maintainer, I’d be taking on another side quest that I don’t need right now. Instead, I find the definition for node_setup that was attempting to be called from the macro and just extract the underlying logic from the helper to use in my test module directly. The solution is very simple:

scenario "when a node shuts down while having an active game session", @cluster_opts do
  setup [:start_apps]

  test "game session is restarted on another node", %{cluster: cluster} do
    # test content remains the same
    # ...
    assert {:error, :game_not_alive} == res
  end
end

defp start_apps(%{cluster: cluster}) do
  Cluster.map(cluster, fn ->
    Application.ensure_all_started(:minotaur)
  end)
end

This time, the Minotaur application is started on each node and the call to :get_game returns the expected :game_not_alive error result. I belive (or really hope) that I have everything I need to start writing my test case to assert the desired behavior before I implement anything in the application code.

Describing behavior through tests

I clear out the inner block of the test and replace it with some placeholder comments for the code I’ll need to complete the test. I also add another comment after the setup callback to tell me I’ll need to add another function that starts a game on node1.

scenario "when a node shuts down while having an active game session", @cluster_opts do
  setup [:start_apps]
  # Start game on node1

  test "game session is restarted on another node", _ctx do
    # Stop node1
    # Assert game is alive on node2
  end
end

Let’s tackle the setup first. I add another callback reference to the setup block with the most obvsious name start_game_on_node1. Then, I implement the function:

import Minotaur.GameEngineFixtures

defp start_game_on_node1(%{cluster: cluster}) do
  [node1, _node2] = Cluster.members(cluster)

  game = game_fixture()

  Cluster.call(node1, fn ->
    Minotaur.GameEngine.continue_game(game)
  end)

  [game: game]
end

game_fixture/1 is a helper function imported by Minotaur.GameEngineFixtures to build a valid Game structure for tests. continue_game/1 is an existing function which attempts to start a new game session process with existing state and register it with Minotaur.GameEngine.SessionRegistry. game is returned by the setup function to add it to the test context so the game id can be referenced in the tests. I validate that the game session is live by writing a temporary check in the test case:

test "game session is restarted on another node", ctx do
  [node1, _] = Cluster.members(ctx.cluster)
  {:ok, game} = Cluster.call(node1, fn -> Minotaur.GameEngine.get_game(ctx.game.id) end)
  assert false == game.id

  # Stop node1
  # Assert game is alive on node2
end

I like to use assert to quickly validate things as I’m building out the test code. I can see the value of game.id is “XYZA” which is the default value from the game fixture. Now that I can see the node is running as expected with the game session process, I want to go back and do a bit of cleanup with the small bit of test code I’ve written.

Let’s create a reference to the individual nodes within the test context instead of calling CLuster.members/1 whenever we need to access a single node.

defp start_apps(%{cluster: cluster}) do
  Cluster.map(cluster, fn ->
    Application.ensure_all_started(:minotaur)
  end)

  [node1, node2] = Cluster.members(cluster)

  [node1: node1, node2: node2]
end

defp start_game_on_node1(%{node1: node1}) do
  game = game_fixture()

  Cluster.call(node1, fn ->
    Minotaur.GameEngine.continue_game(game)
  end)

  [game: game]
end
  test "game session is restarted on another node", ctx do
    {:ok, game} = Cluster.call(ctx.node1, fn -> Minotaur.GameEngine.get_game(ctx.game.id) end)
    assert false == game.id
  end

Next, node1 needs to be stopped which can be accomplished with Cluster.stop_node/2. I can verify that it works as expected by checking the number of elements in Cluster.members/1.

test "game session is restarted on another node", ctx do
  Cluster.stop_node(ctx.cluster, ctx.node1)

  res = Cluster.members(ctx.cluster)
  assert res == false

  # Assert game is alive on node2
end
Assertion with == failed
code:  assert res == false
left:  [:"[email protected]"]
right: false

The last placeholder comment to implement is to check that the game process is accessible from node2. The logic for this respawn behavior has not yet been implemented, but creating this failing test first will give me a way to validate when I’ve achieved the minimum desired behavior by running the test case. I will also have confidence that I don’t have a false positive in my test because I’m making sure it is in a failing state when I expect it to not yet pass.

test "game session is restarted on another node", ctx do
  Cluster.stop_node(ctx.cluster, ctx.node1)

  res =
    Cluster.call(ctx.node2, fn ->
      Minotaur.GameEngine.get_game(ctx.game.id)
    end)

  assert {:ok, %{id: game_id}} = res
  assert ctx.game_id == game_id
end
match (=) failed
code:  assert {:ok, %{id: game_id}} = res
left:  {:ok, %{id: game_id}}
right: {:error, :game_not_alive}

Next Steps

The Horde DynamicSupervisor appears to be a drop in replacement for the native Elixir DynamicSupervisor and will automatically restart supervised processes on another node if the host node is shutdown. Unfortunately, just swapping the modules out was not enough to make things work. The dynamic GenServer processes for game sessions are currently started using start_child/2 of DynamicSupervisor module without an :id value which I believe will need to be set to use Horde.DynamicSupervisor. I’m glad to be back in the application code now that the tests are setup, but I’ll have to tackle this problem another day.