Debugging the Cluster

I updated rel/env.sh.eex so RELEASE_NODE sets the correct longname for the node when the server is started. I noticed that hostname -I can return multiple IPs since the swarm service containers are part of multiple networks. It seems that the first IP in the list is always in the 10.0.0.0/24 subnet which I think is what’s needed for the longname (SPOILERS: It’s the wrong network). I’ll just kept my eval expression simple and grab the first IP:

export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=<%= @release.name %>@$(hostname -I | awk '{print $1}')

I build a new Docker image with this change, start a single container, and start a remote Elixir shell with docker exec -it <container_id> bin/minotaur remote to see if the longname was set correctly. Running node() shows :"[email protected]" which means the longname is correct. (I didn’t see the node connected to a different subnet than I expected)

I update the running service with docker service update --replicas 2 --image minotaur:cluster-test minotaur-stack_minotaur to use the new image on 2 instances. Connecting to the first container shows the longname is correct when running node() but running Node.list() to see if dns_cluster automatically connected the nodes shows an empty list.

I connect to the second container and get a no connection error. The container logs show a looping error:

** Cannot get connection id for node :"[email protected]"

I cycle the whole service with 2 fresh replicas and the first node is now connected to second (I stilled haven’t noticed these addresses are on different subnets):

iex([email protected])2> Node.list()
[:"[email protected]"]

iex([email protected])3> Node.ping(:"[email protected]")
:pong

However, the second container is still getting an error ** Cannot get connection id for node :"[email protected]" and I can’t even start an Elixir shell with it.

In the shell for the first container I run the DNS lookup statement that dns_cluster is using under the hood:

iex([email protected])4> :inet_res.getbyname(~c"tasks.minotaur-stack_minotaur", :a)
{:ok,
 {:hostent, ~c"tasks.minotaur-stack_minotaur", [], :inet, 4,
  [{10, 0, 1, 24}, {10, 0, 1, 25}]}}

Oh wait…

Whoopsie!

Looking back at what I previously wrote, I can see how I could have caught this issue sooner! I made a big assumption that the 10.0.0.0/24 subnet was the overlay network I needed to use, but the overlay is actually 10.0.1.0/24.

Looks like I’ll need a little more sophisticated script to pull out the matching IP address in env.sh.eex:

# Get the IP address for the Docker service overlay network
get_overlay_ip() {
  IPS=$(hostname -I)

  for IP in $IPS; do
    # Check if the IP address is within the expected range (10.0.1.0/24)
    if echo $IP | grep -q "^10\.0\.1\."; then
      echo $IP
      return
    fi
  done
}

IP_ADDRESS=$(get_overlay_ip)

# Set the release to work across nodes.
# RELEASE_DISTRIBUTION must be "sname" (local), "name" (distributed) or "none".
export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=<%= @release.name %>@${IP_ADDRESS}

I rebuild the docker image with the same tag and force a redeploy of the running service with docker service update --force minotaur-stack_minotaur.

Let’s connect to the two containers and see if we have a cluster:

# Node 1
iex([email protected])1> Node.list()
[:"[email protected]"]
iex([email protected])2> Node.ping(:"[email protected]")
:pong

# Node 2
iex([email protected])1> Node.list()
[:"[email protected]"]
iex([email protected])2> Node.ping(:"[email protected]")
:pong

Success! I spin up a third replica just for fun and verify that it is automatically joined to the cluster:

iex([email protected])3> Node.list()
[:"[email protected]", :"[email protected]"]

This is finally the breakthrough I’ve been waiting for. I feel silly for missing some things along the way, but that is part of the process and I’ll keep my mistakes visible here. Keeping this journal actually helped me walk through how I went down the wrong path which is pretty cool!