Distributed Elixir

When I deploy a new version of Minotaur, I will need to migrate active game processes from the existing application instance to new processes in the newly deployed instance. I’ll be trying to accomplish this using the Horde library, but Horde assumes Erlang clustering is already setup in order to use it. I am looking into libcluster to manage dynamically adding newly deployed Minotaur applications to an Erlang cluster in order to leverage the distributed features of Horde.

I am using Docker Swarm to place containers of the Minotaur application within a Docker service. Docker swarm will act as a load-balancer for traffic sent to the published port of the service. When the docker service is updated to use a different image, it will manage spinning up containers for the new image and terminating the existing containers. Docker swarm also manages the networking for the service and its containers which are automatically assigned private IP addresses. I need a way to get the IPs of the service running Minotaur containers to configure the list of hosts for an Erlang cluster.

I found an article which explains that there is an undocumented cluster strategy in the libcluter source code which can be used to poll the Docker swarm DNS service to obtain a list of addresses within a service. The article example uses Distillery to create an Elixir release which was a common strategy at the time when the article was written 5 years ago. As of Elixir v1.9, most of the functionality that Distillery provided has been rolled into Mix, the official build tool that ships with Elixir. A already have mix release configured in the Dockerfile for Minotaur so I can ignore any Distillery specific references in the article.

Do I need this dependency?

As I am editing my mix.exs file to add libcluster as a dependency, I notice dns_cluster in the list. I remember adding this as part of upgrading Phoenix from 1.6 to 1.7 a month ago, but I didn’t really investigate what it was doing for me. I figured if it was being shipped by default in the latest Phoenix version, it must be needed. Investigating what this library can do for me, I find an ElixirConf talk by Phoenix creator Chris McCord that includes a section explaining why this library is shipped with Phoenix by default. Apparently it was added to lower the barrier to entry to get started with clustering which users would avoid in favor of a dependency on Redis which is more familiar than the Elixir primitives.

Can this library do what I’m trying to achieve with libcluster? I can see dns_cluster is already used in Minotaur’s application supervision tree, but unless a DNS query is configured, it will will fallback to do nothing.

{DNSCluster, query: Application.get_env(:minotaur, :dns_cluster_query) || :ignore}

Checking the app’s config/runtime.exs file, I see it is already configured to accept the query as a runtime environment variable.

config :minotaur, :dns_cluster_query, System.get_env("DNS_CLUSTER_QUERY")

Git blame shows that I added this line during the Phoenix upgrade, but I paid little attention to what it was for at the time. At least I now know what it does and I’ll probably be glad I went through the effort if all I need to do is plug in an environment variable at runtime!

Side Quest: Docker, DNS, documentation

The query example from the libcluster article uses tasks.<your_swarm_service_name> for DNS polling. I can’t track down in the Docker Swarm documentation where this convention is defined, so I turn to my friend ChatGPT which I’ve been using lately to help me parse through dense documentation for specific things like this. The AI confirms this is the correct way to query DNS for Docker swarm service, but it can’t provide a satisfying source for this assertion. I eventually find a commit from 2018 in the docker/docs repo which adds this reference, but I don’t see anything like this in the current documentation in 2024. Sure enough, I can perform a DNS lookup using nslookup tasks.minotaur-stack_minotaur from a container within the network.

Activate the cluster

All I should need to have the application start the cluster automatically is to set the environment variable to query Docker swarm service DNS and redeploy the service. I update the compose file for the service:

environment:
  DNS_CLUSTER_QUERY: tasks.minotaur-stack_minotaur

I restart the service with the command docker stack deploy -c docker-compose.yml minotaur-stack_minotaur and connect to a remote REPL shell for the running Elixir application using docker exec -it <container_id> bin/minotaur remote. I run Node.list to see if the application was joined as a node in the cluster. Unfortunately, it returns an empty list.

What went wrong?

I’m not too surprised because getting something to work the first time is a rare occurence. I likely missed something in the documentation for dns_cluster since I had to do very little configuration so far.

Checking the logs for the new container shows the following:

23:15:59.026 [warning] node not running with longnames which are required for DNS discovery.
Ensure the following exports are set in your rel/env.sh.eex file:

    #!/bin/sh

    export RELEASE_DISTRIBUTION=name
    export RELEASE_NODE="myapp@fully-qualified-host-or-ip"

23:15:59.039 [error] ** System NOT running to use fully qualified hostnames **
** Hostname 10.0.1.6 is illegal **

The short and concise README for dns_cluster has the same notice about using Elixir releases which shows I should have fully read the docs:

If you are deploying with Elixir releases, the release must be set to support longnames and the node must be named.

I don’t have a rel/env.sh.eex file in my project, but running mix release.init will generate it along with a few other files to customize releases. The env.sh.eex file contains commented out code as a starting point so I uncomment the following lines:

# Set the release to work across nodes.
# RELEASE_DISTRIBUTION must be "sname" (local), "name" (distributed) or "none".
export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=<%= @release.name %>

I know that the contents of rel/ are copied in the build steps of the project Dockerfile and it appears to take the .eex file to generate an env.sh file witin the release. I rebuild the Docker image locally and see that env.sh contains the line export RELEASE_NODE=minotaur. The warning I previously got from dns_cluster mentioned I need to use longnames which should be in the format of <name>@<ip_addr> so the configuration I have right now won’t be sufficient. I’m guessing I can update the script to something like export RELEASE_NODE=<%= @release.name %>@$(hostname -I) to generate what I need. I’ll try that out next, but I’ll have to leave my progress here for today.