Rolling Deploys: The Finish Line

June 25, 2024 - 4 mins read

Configuring Docker Swarm

To implement rolling deploys for my hosted Minotaur development environment, I need to configure the Docker Swarm service. When updating the service with a new image, the new containers should start before all the old ones are torn down so active game session processes are migrated to nodes of the new containers.

I update the docker compose file for the service stack to ensure the new containers are started first and there is a 10s delay for the old containers to gracefully shutdown.

version: '3.8'

services:
  minotaur:
    image: minotaur:initial
    deploy:
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
      restart_policy:
        condition: on-failure
    environment:
      SECRET_KEY_BASE: ${MINOTAUR_SECRET_KEY_BASE}
      PHX_HOST: ${MINOTAR_PHX_HOST}
      DNS_CLUSTER_QUERY: tasks.minotaur-stack_minotaur

I pull the latest Minotaur code to my dev server and build a new image, using the initial image tag used by the compose file to start the service. I remove the old service and deploy it again.

docker service rm minotaur-stack
docker stack deploy -c docker-compose.yml minotaur-stack

I connect to the web server through two different browsers, create new accounts, and a start a new game. I force a new deploy to the Docker service, but the game does not persist to the new container instance.

Docker is not waiting for the new Elixir application to be fully started before it terminates the old container. I configure Docker to use a health check command which will cause Docker to wait for a healthy response before stopping the old containers. I add the following to my compose file:

    healthcheck:
       test: ["CMD", "wget", "--spider", "-q", "http://localhost:4000/health_check"]
       interval: 30s
       timeout: 10s
       retries: 3
       start_period: 10s

I then add a simple health check Plug to the Phoenix endpoint which will short circuit requests to return an empty 200 status response if the path is /health_check.

I repeat the steps to manually test the process hand off and this time the game session remains active through the deploy! Unfortuantely, the game is resetting to the very beginning. After some quick debugging, I realize this is because I have two entry points to launch a game session server and only one of them implements the game state stash logic.

Refactoring SessionServer

The original method of starting a game session GenServer was to pass to the start_link function a game id and a list of users, but I later added another variation of start_link that accepted a game struct to use as the initial server state. I’ve been building my tests off the second option because the original method involves randomization with player character starting locations and I wanted my tests to be as simple as possible. I always intended to refactor the GenServer code so the game struct intialization for new games happened outside the GenServer and the game state was passed in to start_link. It looks like now is the time to make that change if I want to gain the benefit of the state handoff logic.

After a simple change to the GenServer initialization, my test suite still passes. I pull these changes to my dev server, build a new image, and deploy to the Docker Swarm service. I run another manual test by creating a game and deploying a new replica to the service and this time the stash is picked up and the game retains its state!

The logs for the Docker service show the old game session GenServer shutting down and the new one picking up the state during initialization.

minotaur-stack_minotaur.1.z4hmqths8jyo@cork-dev-debian    | 03:53:49.701 [notice] SIGTERM received - shutting down
minotaur-stack_minotaur.1.z4hmqths8jyo@cork-dev-debian    | 03:53:49.701 [info] Shutting down 3 sockets in 1 rounds of 2000ms
minotaur-stack_minotaur.1.z4hmqths8jyo@cork-dev-debian    | 03:53:49.709 [info] Game Session KEPS stopping due to: shutdown
minotaur-stack_minotaur.1.e0g2ys1whdfj@cork-dev-debian    | 03:53:49.766 [info] Game Session KEPS - Using stashed state

Mission Accomplished

I’ve finally reached my goal of implementing rolling deploys for my Elixir application! It took longer than I expected, but I feel overwhelmingly accomplished for overcoming all the challenges along the way. I greatly appreciate all of the people who blazed the trail for me with libraries or application demos that helped me accomplish this mission.

Thanks to: