Recovering From Full Cluster Shutdown (Part 2)
Test-Driven Design
I’m building the feature to automatically continue active game sessions that aren’t able to be restarted by the session supervisor such as when the entire cluster is shutdown. I start by writing a test for the behavior I am trying to build which will give me a focused feedback loop with which to iterate on this feature. As usual, I write an non-compiling test structured how I want the interface to behave which will guide the function design from the outside in.
defmodule Minotaur.GameEngine.ResumeActiveSessionsTest do
use ExUnit.Case
alias Minotaur.GameEngine
describe "a saved game has an active status but it has no running process" do
setup [:start_game, :stop_game]
test "game session is started", ctx do
:ok = GameEngine.resume_active_sessions()
assert {:ok, game} = GameEngine.get_game(ctx.game.id)
end
end
end
I update the test module with all the plumbing required to make the test compile and add a dummy implementation for resume_active_sessions
.
The test still fails, but I want to make sure I start with a failing test to know that I didn’t start with a faulty test giving a false positive result.
defmodule Minotaur.GameEngine.ResumeActiveSessionsTest do
use Minotaur.DataCase
import Minotaur.GameEngineFixtures
alias Minotaur.GameEngine
alias Minotaur.GameEngine.SessionSupervisor
describe "a saved game has an active status but it has no running process" do
setup [:start_game, :stop_game]
test "game session is started", ctx do
:ok = GameEngine.resume_active_sessions()
assert {:ok, game} = GameEngine.get_game(ctx.game.id)
assert game == ctx.game
end
end
defp start_game(_ctx) do
game = game_fixture_with_users()
{:ok, pid} = GameEngine.continue_game("GAME_ABC", game)
[game: game, session_pid: pid]
end
defp stop_game(ctx) do
:ok = Horde.DynamicSupervisor.terminate_child(SessionSupervisor, ctx.session_pid)
assert {:error, :game_not_alive} = GameEngine.get_game(ctx.game.id)
:ok
end
Now the test is in place for me to run it for fast feedback as I build the actual feature. Once the test is passing, I’ll know I’ve fulfilled the specific behavior for this case.
I update resume_active_sessions
to look up any saved game summaries that have an active game status and extract the join_code
and game_state
.
I pipe this list into Enum.each/2
which will use the existing continue_game
function to start a new game session process under a dynamic supervisor using the join_code
and game_state
arguments as the GenServer state.
defmodule Minotaur.GameEngine do
# …
def resume_active_sessions do
GameSessionSummary
|> where(game_status: :active)
|> select([g], {g.join_code, g.game_state})
|> Repo.all()
|> Enum.each(fn {join_code, game} ->
continue_game(join_code, game)
end)
:ok
end
end
The test is still failing because game_state
is never saved to the GameSessionSummary
record and will default to an empty map.
continue_game
is throwing a function clause match error since it expects a Game
struct and not the map being passed to it.
I now need to save the game state struct with the summary record when the game session is initialized and implement the Ecto encoding I was experimenting with the other day.
Ecto Embedded Schemas
I update the GameSessionSummary
schema to use embed_one
with a Game
struct type for the :game_state
field instead of the previous :map
type.
I also need to add put_embed
to the changeset so the game_state will be added to the change list.
defmodule Minotaur.GameEngine.GameSessionSummary do
# …
schema "game_session_summaries" do
field :game_id, :binary_id
field :join_code, :string
field :game_status, Ecto.Enum, values: [:active, :concluded, :canceled]
field :latest_round, :integer
embeds_one :game_state, Game
timestamps()
has_many :player_game_session_summaries, PlayerGameSessionSummary,
foreign_key: :game_session_id
end
def changeset(game_session_summary, attrs) do
game_session_summary
|> cast(attrs, @required_fields)
|> validate_required(@required_fields)
|> put_embed(:game_state, attrs[:game_state])
|> put_assoc(:player_game_session_summaries, attrs.player_summaries)
end
end
I then convert the Game
struct module to use Ecto’s embedded_schema
definition so Ecto will automatically convert the stored game state map to the proper struct.
defmodule Minotaur.GameEngine.Game do
use Ecto.Schema
@current_version 1
@primary_key false
embedded_schema do
field :id, :binary_id
field :round_end_time, :utc_datetime
field :world, :map
field :players, :map
field :registered_actions, :map, default: %{}
field :round, :integer, default: 1
field :version, :integer, default: @current_version
end
end
There are now issues with the GameSessionSummary
record inserts when I run the test which is due to Ecto not knowing how to encode the nested data structures within the Game
struct.
The first of these is the World
struct with a :grid
attribute which is a map using Coord
structs for its keys.
Ecto doesn’t know how to automatically encode this so I need to implement a custom encode function to override the default struct encoder that Ecto uses with the Jason
library.
This function converts the Coord
keys to JSON strings to make them valid keys for the outer map to be converted to a JSON object.
defmodule Minotaur.GameEngine.World do
defstruct grid: %{}, player_characters: %{}, dead_characters: %{}
defimpl Jason.Encoder do
def encode(struct, opts) do
grid =
struct.grid
|> Enum.map(fn {coord, hex} -> {Jason.encode!(coord), hex} end)
|> Enum.into(%{})
map =
struct
|> Map.take([:player_characters, :dead_characters])
|> Map.put(:grid, grid)
Jason.Encode.map(map, opts)
end
end
end
The World
struct data can now be encoded to a JSON string for storing to the database, but Ecto doesn’t know how to convert the :grid
data back to the original Coord
struct keys after the value has been retrieved.
I can get this behavior by converting World
to an embedded schema within Game
and creating a custom Ecto.Type
for :grid
.
This new type defines a cast
callback used by Ecto to convert the data back to the format of the domain.
defmodule Minotaur.GameEngine.EctoTypes.GridMap do
use Ecto.Type
alias Minotaur.GameEngine.{Coord, GridHex}
def type, do: :map
def cast(value) when is_map(value) do
grid =
value
|> Enum.map(fn {k, v} -> {decode_coord(k), convert_hex(v)} end)
|> Enum.into(%{})
{:ok, grid}
end
def cast(_), do: :error
defp decode_coord(value) do
map = Jason.decode!(value)
attrs = convert_to_atom_keys(map)
struct(Coord, attrs)
end
defp convert_hex(string_key_map) do
attrs = convert_to_atom_keys(string_key_map)
struct(GridHex, attrs)
end
defp convert_to_atom_keys(map) do
Enum.reduce(map, %{}, fn {key, value}, acc ->
new_key = convert_key(key)
Map.put(acc, new_key, value)
end)
end
defp convert_key(key) when is_binary(key), do: String.to_existing_atom(key)
defp convert_key(key), do: key
# load and dump are not used for embeds, but required for Ecto.Type behaviour
def load(_), do: :error
def dump(_), do: :error
end
After changing World
to an embedded schema, Ecto no longer uses my custom encode
function for the World
module and instead is calling to_string
on the Coord
struct keys which is not implemented.
This behavior is puzzling to me based on my experience with Ecto encoding and what I can find online, but the quick solution is to just implement this protocol and have to_string
run the encode on the string.
Since the custom encoder for World
is no longer called, I can delete that code.
defmodule Minotaur.GameEngine.Coord do
alias __MODULE__, as: Coord
@derive Jason.Encoder
defstruct [:q, :r]
defimpl String.Chars do
def to_string(%Coord{} = struct) do
Jason.encode!(struct)
end
end
end
I follow the same steps for a few more nested maps with struct values to make the game state returned from Ecto queries automatically convert everything to the correct struct format. The test is also passing so I will add a few more test cases such as when the game process is already alive.
Not so fast…
I notice that the state being set in the resumed game GenServer process is not coming from the game state being passed to continue_game
.
The in-memory state stash, StateHandoff
, I previously created for migrating state to restarting processes across the cluster is taking precedence for setting the state during GenServer initialization.
This could potentially mask if there is a hidden bug with the decoding logic since the tests aren’t validating on the correct data.
I wonder if I should just have all state handoff be done with the database to simplify this part of the code since the GameSessionSummary
records now cover a similar use case.
For now, I update the test to clear this state after the game session process is stopped to simulate a fresh startup for the cluster. I also update the original test case to not match on the entire game state due to a datetime value losing precision in the database insert.
describe "a saved game has an active status but it has no running process" do
setup [:start_game, :stop_game, :clear_stash]
test "game session is started", ctx do
:ok = GameEngine.resume_active_sessions()
assert {:ok, game} = GameEngine.get_game(ctx.game.id)
assert game.id == ctx.game.id
assert game.players == ctx.game.players
assert game.world == ctx.game.world
end
end
Next, I should run the new resume_active_sessions
function during application startup to close out this feature.