I Got it Wrong
You know nothing
I’m not sure what is causing the intermittent failure for the distributed supervisor test.
About one out of three runs will fail.
Extending the wait time for node2 to start the game session does not seem to make a difference so I doubt it is an issue with the supervisor needing more time to handle the restart.
I’m curious if there the problem could be that the first node is being shutdown too quickly before the supervised process can be synced across all nodes.
I add a :timer.sleep(300)
before the call to shutdown node1 and find the test passes 100% of the time after about 20 test runs.
This is an interesting outcome.
It seems that giving more time before node1 is shutdown allows the process to successfully start on node2.
Since I’m still not fully understanding why the test is passing with this change, I want to test my test a bit more.
I comment out the line Cluster.stop_node(ctx.cluster, ctx.node1)
to see if the test will fail since the game session should only be started on node2 after node1 goes down.
The test still passes.
I am clearly missing some context about how Horde’s DynamicSupervisor
works across nodes in a cluster.
I create a helper function to look up the children references for a given node in the cluster:
defp get_sup_children(node) do
Cluster.call(node, fn ->
Horde.DynamicSupervisor.which_children(Minotaur.GameEngine.SessionSupervisor)
end)
end
And in my test block, I add these lines to see the local state of the supervisor on each node:
IO.inspect("Node1 Children")
IO.inspect(get_sup_children(ctx.node1))
IO.inspect("Node2 Children")
IO.inspect(get_sup_children(ctx.node2))
The result is that both nodes share the same supervisor state.
"Node1 Children"
[{:undefined, #PID<27225.458.0>, :worker, [Minotaur.GameEngine.SessionServer]}]
"Node2 Children"
[{:undefined, #PID<27225.458.0>, :worker, [Minotaur.GameEngine.SessionServer]}]
What I did not realize before this is that the supervisor tracks the pid
of the game process across all node states even though the process only lives on a single node.
This can be helpful to find out which node the process belongs to from any Node as I’m debugging.
Once I get the supervisor working, I’ll be using Horde’s distributed registry to track processes on the cluster.
I poke around my test more to see what other assumptions I may be making with this tool.
Since node1 is no longer being shutdown, I wonder what is the state of the game on node1 when node2 seems to able to return the state from get_game/1
.
IO.inspect("Node1 Game:")
IO.inspect(
Cluster.call(ctx.node1, fn ->
Minotaur.GameEngine.get_game(ctx.game.id)
end)
)
"Node1 Game:"
{:error, :game_not_alive}
Even though node1 is not being shutdown, the game process is still being moved to node2. I was not expecting this and need to review the docs to find out why. On the docs page for Horde.UniformDistribution, which is the default strategy for distributing processes, it states:
Distributes processes to nodes uniformly using a hash ring.
Given the same set of members, it will always start the same process on the same node.
Digging into the DynamicSupervisor
source code for start_child/2
, I see that whenever a new child process is started under the distributed supervisor, it relies on the distribution strategy to determine on which node it should be placed.
I had made the wrong assumption that the node that called start_child/2
would be the default placement and only when it was shutdown would the distribution logic come into play.
I likely made that assumption based on my use case of using Horde for rolling deploys with a single node cluster.
Aha!
I update my helper function to see if my new understanding of Horde can help explain the current test behavior. The return value now shows if the supervised child is owned by the specified node.
defp get_sup_children(node) do
children =
Cluster.call(node, fn ->
Horde.DynamicSupervisor.which_children(Minotaur.GameEngine.SessionSupervisor)
end)
Enum.map(children, fn {_, pid, _, _} ->
owned? = node(pid) == node
[pid: pid, owned_by_node?: owned?]
end)
end
The logging output shows where this game session process actually lives:
"Node1 Children"
[[pid: #PID<27257.458.0>, owned_by_node?: false]]
"Node2 Children"
[[pid: #PID<27257.458.0>, owned_by_node?: true]]
Things are making sense now. The test still passed when node1 wasn’t even shutdown because the process was being spawned on node2 the whole time.
When I rerun the tests until it hits the intermittent failure, I can see from the same debug logs that the pid
is owned by node1 in these cases:
"Node1 Children"
[[pid: #PID<27257.459.0>, owned_by_node?: true]]
"Node2 Children"
[[pid: #PID<27257.459.0>, owned_by_node?: false]]
1) test when a node shuts down while having an active game session game session is restarted on another node (Minotaur.Cluster.GameSessionTest)
test/cluster/game_session_test.exs:16
match (=) failed
code: assert {:ok, %{id: game_id}} = res
left: {:ok, %{id: game_id}}
right: {:error, :game_not_alive}
The source of the flakey tests is the non-deterministic nature of Horde’s distribution strategy!
I also realize that my test setup doesn’t represent the scenario I’m trying to test anyway. I want to verify if a rolling deploy for a single node cluster will restart existing game processes on a new node once the original is shut down, but my test setup is starting with 2 nodes in the cluster before the game process is started. To more closely resemble the targeted scenario and also remove the non-deterministic variable from the test, I should ammend my setup as such:
- Start with a single node in the cluster
- Start the game process
- Add a second node to the cluster
My test logic should remain the same which is stopping the first node and checking for the game process to be alive on node2.
The cluster testing library I’m using, ex_unit_clustered_test doesn’t support adding new nodes to a cluster once it has been setup in the scenario
block.
I need to spin up a new node and join the cluster manually, but I don’t believe I can use the library’s cluster helper interfaces to interact with manaully joined nodes.
I’ll dig into the clustered case library code to see how nodes are spun up and managed in the cluster and see if this offers some clues for how to approach this problem.