Adding Ecto When I generated a new Phoenix project for Minotaur, I ran the phx.new command with the --no-ecto switch since I didn’t want to worry about any database details until I had built out an initial concept. It is now time to introduce some form of persistence to the application as everything so far has been in-memory storage.
Adding Ecto to an existing project isn’t too complicated. I generate a new Phoenix app using the default included Ecto library with the PostgreSQL adapter and compare the differences with my application.
Configuring Docker Swarm To implement rolling deploys for my hosted Minotaur development environment, I need to configure the Docker Swarm service. When updating the service with a new image, the new containers should start before all the old ones are torn down so active game session processes are migrated to nodes of the new containers.
I update the docker compose file for the service stack to ensure the new containers are started first and there is a 10s delay for the old containers to gracefully shutdown.
Stashing state on game process termination I’ve had slow but steady progress toward my goal of having rolling deploys that preserve running game session state. I sometimes only have an hour or two of free time in a day to spend time on this project, but breaking down the problems into smaller pieces and writing this dev journal has allowed me to be more focused and collect my thoughts.
Here is my task list:
Replicating data with CRDTs To replicate state across nodes, the StateHandoff GenServer on each node will need to add references to the CRDT instances of all other nodes in the cluster. The DeltaCrdt library provides the function set_neighbours/2 which configures a CRDT on a node with a list of DeltaCrdt processes on other nodes with which it can sync state. The docs state this is a unidirectional sync so a call to set_neighbours will need to be made on each node in the cluster in order to fully sync data across all nodes.
Testing state handoff Before implementing the DeltaCrdt library to create a replicated data store, I first want to build the interface for the data store. Yesterday, I created the StateHandoff module to play around with cluster node join/leave events, but it doesn’t actually do anything yet.
I create a new test module which will help me think about the public interface for this module and guide my design decsions as I implement the desired behavior.
Delta CRDT I need to decide on how game session state will be handed off to a process on a new node during a rolling deploy. I want to delay using external tools like a database or Redis for as long as possible mostly to reduce infrastructure costs for this initial prototype. After learning more about how Horde implements Delta CRDTs to sync state across cluster nodes, I want to apply the same concept to process state handoffs in Minotaur.
New difficulty unlocked On June 17th, my wife gave birth to our beautiful and healthy baby girl! I am genuinely excited for this new chapter in our lives. This is our second child so I know my time for personal projects will be even more constrained than with just one kid, so I will have to find the right balance of family time, work, a healthy sleep schedule, and time for project work where I can spare it.
A third option I am still working to find a reliable way to test process handoff in a cluster. LocalCluster seemed promising, but its internal mechanism for spawning remote nodes uses :slave.start_link which requires converting the calling process node into a distributed node and that is not going to work for the scenario I’m testing. I’m going to need to hack the library code or create my own implementation to support my use case.
When the solution becomes its own problem My current goal is to fix my cluster test so that it more closely reflects the scenario of restarting a game session on a new node during a rolling deploy. I have been tinkering with spawning a slave node and manually joining the cluster provided by ex_unit_clustered_case:
{:ok, node2} = :slave.start_link(:"127.0.0.1", :node2) # Add code paths to new node :code.get_path() |> Enum.each(fn path -> :rpc.
You know nothing I’m not sure what is causing the intermittent failure for the distributed supervisor test. About one out of three runs will fail. Extending the wait time for node2 to start the game session does not seem to make a difference so I doubt it is an issue with the supervisor needing more time to handle the restart. I’m curious if there the problem could be that the first node is being shutdown too quickly before the supervised process can be synced across all nodes.