for-aws Swarm quorum loss during Docker for AWS Upgrade

I'm not sure what the best way is to log this, or even what the root issue was, but it's happened twice so far upgrading to the latest (17.06.0-ce-aws2) Docker for AWS - once in a pre-production environment in eu-west-1, and once in a production environment in us-east-1.

Unfortunately I wasn't around when the upgrade happened (yesterday), but I've been troubleshooting and attempting to recover today.

This was a 5 managers/5 workers swarm, and the CFN logs indicated that each of the launches succeeded with messages like Received SUCCESS signal with UniqueId i-0a9a78820a77ded35.

Somewhere, I suspect around the 4th termination, quorum was lost.

After I SSHed in and ran docker info on each of the managers, I saw the following:

10.200.1.115 had IsManager: true, and 5 Manager Addresses listed
10.200.0.236 had IsManager: true and 7 Manager Addresses listed
10.200.1.87 had IsManager: true and 7 Manager Addresses listed
10.200.2.109 had IsManager: true and 6 Manager Addresses listed
10.200.0.59 had Swarm: inactive

(the above are listed in order of launch time)

All I can do is speculate about the cause of this, but I think what happened was the LifecycleHook wasn't properly triggered, and the terminated nodes didn't properly get removed from all of the managers' node lists.

Either way, quorum was lost during the upgrade.

Some other strange points:

each new manager took ~18mins to launch and report SUCCESS
after the ManagerAsg finished updating, the NodeAsg started updating, and each update reported SUCCESS (which is odd because at this point there would've been no quorum, so the workers should not have been able to join the swarm)
the entire upgrade took over 2.5hrs to complete - unusually high in our experience (other regions took ~30mins)

I attempted to recover the swarm with the following steps:

ran docker swarm leave --force on managers 2, 3, and 4 (5 never joined the swarm)
on manager 1, I ran docker swarm init --advertise-addr 10.200.1.115 --force-new-cluster
I confirmed that docker node ls returned that manager as the Leader, and no other managers. It also had entries for nodes from the prior state, which were all listed as Down
I used docker swarm join-token manager to get the command to join a new manager
I ran the command on manager 2, but the command took a very long time. It eventually finished:

~ $  docker swarm join --token SWMTKN-1-63jf319835r1lrtxt9b6cht6v5lojtesavdzl5mkx5n9ol68xl-e9r835v6mf6bw5wu60b4o1x5z 10.200.1.115:2377
This node joined a swarm as a manager.

I attempted to run docker node ls on manager 1, but it failed:

Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

I attempted to run docker node ls on manager 2, and it failed the same way:

~ $ docker node ls
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

So apparently somehow quorum was lost immediately after the 2nd manager joined successfully.

I attempted to recover again, by running a force-leave on manager 2, and init --force-new-cluster on manager 1, this time running docker node rm on each other node in the list, to make sure there wasn't some sort of node state in the way, but I had the same behaviour.

At the end, I initialized a new swarm from one of the other managers, and was able to join all other managers except for the first. I couldn't successfully run docker swarm leave --force on manager 1, and ended up having to terminate it.

Sorry for the long rambling report, but I'm really not sure where to start with this. I think for future upgrades we'll obviously need to babysit the swarm, but I'm worried that we'll lose quorum again.

Jul 27 '17 21:07 hairyhenderson

I'm dealing with this problem right now. It's happened to me multiple times.. a massive pain.

Oct 10 '17 14:10 OmpahDev

@sshorkey I should've updated this issue earlier, but I have some further updates for this...

The short story is - there seems to be a bug where the Swarm state got inconsistent and hosts that were removed weren't properly reflected in the raft log. In my scenario, I actually disabled the rolling upgrades and started rolling the nodes manually, and when I left one of them for ~4hrs it actually joined eventually. While it was doing so it was making lots of TCP connections to these "ghost" nodes, and waiting around on timeouts.

The fix was to configure raft snapshots to happen more often, by setting the snapshot-interval value to 1000 (from a default of 10000). In my particular case, my swarms aren't terribly active, so it takes ~2 months for 10000 entries to be added to the raft log. Once I changed that setting (with the docker swarm update --snapshot-interval 1000 command), a snapshot was almost immediately created, and hosts added after that point joined and were fully communicative within a few seconds.

This is more of a workaround than a fix, because the raft log shouldn't have gotten in that state to begin with.

Hope that helps!

Oct 10 '17 14:10 hairyhenderson

for-aws for-aws copied to clipboard

Swarm quorum loss during Docker for AWS Upgrade

for-aws
for-aws copied to clipboard