New difficulty unlocked

On June 17th, my wife gave birth to our beautiful and healthy baby girl! I am genuinely excited for this new chapter in our lives. This is our second child so I know my time for personal projects will be even more constrained than with just one kid, so I will have to find the right balance of family time, work, a healthy sleep schedule, and time for project work where I can spare it.

Today I found just the opportunity when the whole household is napping and I can crack open this project for a short time.

Finishing the cluster test (for real this time)

With the new cluster test setup implemented, I now can update the initial cluster test case to assert the game process is spawned on a second node after the initial node is stopped.

Here is the updated code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
defmodule Minotaur.Cluster.GameSessionTest do
  @moduledoc false

  use ExUnit.Case
  @moduletag :cluster_test

  import Minotaur.GameEngineFixtures
  import Minotaur.TimeHelper

  alias Minotaur.ClusterHelper, as: Cluster
  alias Minotaur.GameEngine

  setup ctx do
    cluster = start_supervised!({Cluster.Manager, ctx})

    [cluster: cluster]
  end

  describe "when a node shuts down while having an active game session" do
    setup [:start_cluster, :start_game, :add_node2, :stop_node1]

    test "game session is restarted on another node", ctx do
      wait_until(fn ->
        res = Cluster.call(ctx.cluster, ctx.node2, GameEngine, :get_game, [ctx.game.id])

        assert {:ok, %{id: game_id}} = res
        assert ctx.game.id == game_id
      end)
    end
  end

  defp stop_node1(%{cluster: cluster, node1: node1}) do
    Cluster.stop_node(cluster, node1)
  end

  defp add_node2(%{cluster: cluster}) do
    node2 = Cluster.start_node(cluster, environment: cluster_env())

    [node2: node2]
  end

  defp start_game(%{cluster: cluster, node1: node1}) do
    game = game_fixture()
    {:ok, _pid} = Cluster.call(cluster, node1, GameEngine, :continue_game, [game])

    [game: game]
  end

  defp start_cluster(%{cluster: cluster}) do
    node1 = Cluster.start_node(cluster, environment: cluster_env())

    [node1: node1]
  end

  defp cluster_env do
    endpoint_env =
      Application.get_env(:minotaur, MinotaurWeb.Endpoint, [])
      |> Keyword.put(:server, false)

    [minotaur: [{MinotaurWeb.Endpoint, endpoint_env}]]
  end
end

As a quick sanity check, I comment out the code for stopping node2 to verify that the test does not have a false positive. The test fails with the assertion that the game process is not alive on node2 which is what I expect.

I can finally call this first test case done!

Retrospective

It took me much longer than I expected to get to this point of writing the first cluster test. I could have simply run some quick manual test with my Docker deployments to confirm that it works as expected and jumped back to solving the next problems sooner. So, was it worth the 8+ hours fiddling with three different libraries to get here?

I definitely think it was worth it! I learned much more about distributed nodes in Elixir/Erlang than if I had opted for manual testing. I certainly didn’t expect to contribute to an open source project when I started down this rabbit hole. My pull request was even merged already!

I’ve also laid the foundation to write more cluster tests which can easily be run locally with the rest of my test suite. Had I taken the shortcut of only doing manually testing, I would not have the assurance of automated tests to detect unexpected breaking changes or regressions.

Failing and learning is a critical part of the development process. I consider this series of side quests an overall success to my personal growth as a developer and this project is now in a better position for continuous integration.

Next steps

Let’s review what are the milestones for persisting game processes on a rolling code deploy:

  • Launch the application as nodes in an Erlang Cluster on deploy.
  • Start new game session processes on a new version node to replace the sessions of the old version node.
  • LiveView processes continue to communicate with the game sessions after session processes have migrated to new nodes.
  • State for each game session is stashed before the old session process terminates.
  • New game session processes initialize with the stashed state for the respective session.

The next problem to tackle is making sure any processes which are dependent on game processes know how to find them after game processes have migrated to a new node. I will start with a failing test to outline what I am trying to accomplish and know that I’ve achieved the desired behavior once the test is passing.

Distributed registry

I create a new describe block since the scenario I am simulating is when a new node is started, but the old node hasn’t shutdown yet. I want to be sure that any new web connections coming to the new node are able to access the existing game process on the first node until the process moves to the new node.

describe "when a new node is added to the cluster" do
  setup [:start_cluster, :start_game, :add_node2]

  test "game session process is available across the cluster", ctx do
    wait_until(fn ->
      res = Cluster.call(ctx.cluster, ctx.node2, GameEngine, :get_game, [ctx.game.id])

      assert {:ok, %{id: game_id}} = res
      assert ctx.game.id == game_id
    end)
  end
end

And I get a failing test as expected:

1) test when a new node is added to the cluster game session process is available across the cluster (Minotaur.Cluster.GameSessionTest)
   test/cluster/game_session_test.exs:22
   match (=) failed
   code:  assert {:ok, %{id: game_id}} = res
   left:  {:ok, %{id: game_id}}
   right: {:error, :game_not_alive}

The Horde library offers a distributed registry module that tracks processes across the cluster which should be exactly what I need to make this test pass.

I update the native Elixir Registry references for game sessions to Horde.Registry:

defmodule Minotaur.Application do
  def start(_type, _args) do
    children = [
      # ...
+     {Horde.Registry, keys: :unique, name: Minotaur.GameEngine.SessionRegistry, members: :auto},
-     {Registry, keys: :unique, name: Minotaur.Lobbies.SessionRegistry},
      # ...
    ]
  
    opts = [strategy: :one_for_one, name: Minotaur.Supervisor]
    Supervisor.start_link(children, opts)
  end
end
defmodule Minotaur.GameEngine.SessionServer do
  # ...

  def start_link(game_state: game) do
+   name = {:via, Horde.Registry, {SessionRegistry, game.id}}
-   name = {:via, Registry, {SessionRegistry, game.id}}
    GenServer.start_link(__MODULE__, game, name: name)
  end
end
defmodule Minotaur.GameEngine do
  # ...

  def get_game(game_id) do
    game = GenServer.call(via_tuple(game_id), :get_game)
    {:ok, game}
  catch
    :exit, {:noproc, _} -> {:error, :game_not_alive}
  end

  defp via_tuple(game_id) do
+   {:via, Horde.Registry, {SessionRegistry, game_id}}
-   {:via, Registry, {SessionRegistry, game_id}}
  end
end

I run the test and it passes! The full test suite also passes with this change.

One thing I notice with this change is that if I go back to my first test and comment out the logic for stopping node1, that test still passes. Since game session process are tracked via a distributed registry, it doesn’t matter which node is used to look up the game session. I am glad that I am introducing these changes methodically and one at a time so I really know which problem each component is solving.

Persisting game session state

The changes made by implementing Horde’s distributed supervisor and registry allow game session processes to be restarted on new nodes, however, the game session will be restarted to the original state unless some form of hand off logic is implemented. I broke this state hand off problem into two parts:

  • Stashing state somewhere (TBD) on process shutdown
  • Importing existing state on process initialization

The location to store the state will need to be some place accessible from both the terminating node and the new node which will respawn the process. I still haven’t implemented a database for persistent storage which I could do now to backup game session state.

Another option could be a distributed in-memory storage solution like Redis. The Horde library uses Delta Conflict-free Replicated Data Types (CRDTs) for its distributed supervisor and regsitry which is another option I could implement for state handoff.

I’ll be spendng some time exploring these options.