Recovering From Full Cluster Shutdown (Part 1)

July 31, 2024 - 5 mins read

Out of memory!

Today, I noticed my development droplet in Digital Ocean does not have enough memory to build docker images after adding the Timex dependency to the project. The running state on the server is usually around 50% memory usage, but it spikes during builds and deploys. I have kept the resource size low to save a few bucks, but the time has finally come to upgrade the droplet to 2GB of memory.

I have to power off the droplet to perform the upgrade. Since Minotaur is using Docker swarm to cluster containers on this single droplet, all active game session processes will be stopped and the in-memory state stash will be lost. This will be the next issue I tackle to preserve active games when the full cluster is brought down and back up.

Cold start game session recovery

For the game processes to be resilient in a single node configuration, the game state will need to be stashed with persistent storage which can survive a full system crash. The supervisor processes are able to restart child processes that crash during runtime, but I’ll need to implement a mechanism to recover active games from a cold start.

When a game session process is started, some details about the session are saved to the database already, but now I want to save the entire game state which can be recovered if the process has to be recovered from a cold application start. I also need to update this persisted state at the end of every round so the latest state can be recovered.

I add a new version attribute to the Game struct since the design of game state data will surely change in the future and knowing which version of the data is being used will be incredibly useful.

I create a migration file to update the game_session_summaries table with a new field for the game state which is a Game struct serialized into a JSON string. Since there are already game session summary records in the database, I need to set a default value for the migration. I will delete any records with an empty map for the state field in the future.

defmodule Minotaur.Repo.Migrations.AddGameStateToSessionSummary do
  use Ecto.Migration

  def change do
    alter table(:game_session_summaries) do
      add :game_state, :map, null: false, default: %{}
    end
  end
end

I run the migration in my local dev and test environments then update the GameSessionSummary schema to add game_state as a field.

  schema "game_session_summaries" do
    field :game_id, :binary_id
    field :join_code, :string
    field :game_status, Ecto.Enum, values: [:active, :concluded, :canceled]
    field :latest_round, :integer
    field :game_state, :map

    timestamps()

    has_many :player_game_session_summaries, PlayerGameSessionSummary,
      foreign_key: :game_session_id
  end

Encoding game state

Before a Game struct can be inserted into the database, it needs to be encoded as a JSON string. One way to do this is to convert the Game struct and all its nested structs to Ecto embedded schemas. This option comes with a lot of built-in benefits for converting data from the database format to application domain structures, but would require updating all struct modules related to game state with Ecto boiler plate.

Another option is to build the encoding and decoding logic outside of Ecto and manage conversion in the application code. This could leave the struct modules cleaner by defining the encoding at the “edges” of the application where database interaction occurs, but doesn’t have some of the nice built-in features of Ecto schemas.

I spend some time playing with both options to get a feel for the code and the tradeoffs. The custom encoding ends up being pretty simple since I can hook into the Jason.Encoder implementation protocol which Ecto uses for embeds. However, due to the number of nested structs in the game state data, I prefer not to have to manage the conversion of database data to structs myself. The option to use Ecto embedded_schema for all nested game state structs requires a lot more experimentation and learning.

Some of the fields in the game state structs will require using custom Ecto.Type definitions to allow Ecto to map the data back to structs from the database. For example, the World struct has a :grid field which is a map that uses Coord structs for its keys. The docs for Ecto.Type specify four callbacks that need to be defined to fulfill the requirements for the behaviour which are type, cast, load, and dump. I spent far too long trying to understand why the embedded schemas were not working as expected with these custom types before I found an important note in the embeds_one/3 docs:

Because many databases do not support direct encoding and decoding of embeds, it is often emulated by Ecto by using specific encoding and decoding rules. For example, PostgreSQL will store embeds on top of JSONB columns, which means types in embedded schemas won’t go through the usual dump->DB->load cycle but rather encode->DB->decode->cast. This means that, when using embedded schemas with databases like PG or MySQL, make sure all of your types can be JSON encoded/decoded correctly. Ecto provides this guarantee for all built-in types.

For my case with embeds, I don’t actually need to use dump or load for converting values which is where I was spending my time. I only need to configure the cast function for my custom types to convert the data returned from the database to the proper structs. Ecto also won’t handle encoding the struct, so I’ll have to provide the encoding logic. Fortunately, I had already handled this when experimenting with manually converting data from the database and doing it a second time will be much faster.

I now have sufficient knowledge of how to build Ecto embeds with custom types so I will stash these tinkering changes for future reference and start again with the feature I want to build. I don’t feel bad about writing throwaway code for learning purposes since I can rewrite it much faster and with a better understanding than the first time.