for-aws
for-aws copied to clipboard
Swarm quorum loss during Docker for AWS Upgrade
I'm not sure what the best way is to log this, or even what the root issue was, but it's happened twice so far upgrading to the latest (17.06.0-ce-aws2) Docker for AWS - once in a pre-production environment in eu-west-1, and once in a production environment in us-east-1.
Unfortunately I wasn't around when the upgrade happened (yesterday), but I've been troubleshooting and attempting to recover today.
This was a 5 managers/5 workers swarm, and the CFN logs indicated that each of the launches succeeded with messages like Received SUCCESS signal with UniqueId i-0a9a78820a77ded35.
Somewhere, I suspect around the 4th termination, quorum was lost.
After I SSHed in and ran docker info on each of the managers, I saw the following:
10.200.1.115hadIsManager: true, and 5Manager Addresseslisted10.200.0.236hadIsManager: trueand 7Manager Addresseslisted10.200.1.87hadIsManager: trueand 7Manager Addresseslisted10.200.2.109hadIsManager: trueand 6Manager Addresseslisted10.200.0.59hadSwarm: inactive
(the above are listed in order of launch time)
All I can do is speculate about the cause of this, but I think what happened was the LifecycleHook wasn't properly triggered, and the terminated nodes didn't properly get removed from all of the managers' node lists.
Either way, quorum was lost during the upgrade.
Some other strange points:
- each new manager took ~18mins to launch and report
SUCCESS - after the
ManagerAsgfinished updating, theNodeAsgstarted updating, and each update reportedSUCCESS(which is odd because at this point there would've been no quorum, so the workers should not have been able to join the swarm) - the entire upgrade took over 2.5hrs to complete - unusually high in our experience (other regions took ~30mins)
I attempted to recover the swarm with the following steps:
- ran
docker swarm leave --forceon managers 2, 3, and 4 (5 never joined the swarm) - on manager 1, I ran
docker swarm init --advertise-addr 10.200.1.115 --force-new-cluster - I confirmed that
docker node lsreturned that manager as the Leader, and no other managers. It also had entries for nodes from the prior state, which were all listed asDown - I used
docker swarm join-token managerto get the command to join a new manager - I ran the command on manager 2, but the command took a very long time. It eventually finished:
~ $ docker swarm join --token SWMTKN-1-63jf319835r1lrtxt9b6cht6v5lojtesavdzl5mkx5n9ol68xl-e9r835v6mf6bw5wu60b4o1x5z 10.200.1.115:2377
This node joined a swarm as a manager.
- I attempted to run
docker node lson manager 1, but it failed:
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
- I attempted to run
docker node lson manager 2, and it failed the same way:
~ $ docker node ls
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
So apparently somehow quorum was lost immediately after the 2nd manager joined successfully.
I attempted to recover again, by running a force-leave on manager 2, and init --force-new-cluster on manager 1, this time running docker node rm on each other node in the list, to make sure there wasn't some sort of node state in the way, but I had the same behaviour.
At the end, I initialized a new swarm from one of the other managers, and was able to join all other managers except for the first. I couldn't successfully run docker swarm leave --force on manager 1, and ended up having to terminate it.
Sorry for the long rambling report, but I'm really not sure where to start with this. I think for future upgrades we'll obviously need to babysit the swarm, but I'm worried that we'll lose quorum again.
I'm dealing with this problem right now. It's happened to me multiple times.. a massive pain.
@sshorkey I should've updated this issue earlier, but I have some further updates for this...
The short story is - there seems to be a bug where the Swarm state got inconsistent and hosts that were removed weren't properly reflected in the raft log. In my scenario, I actually disabled the rolling upgrades and started rolling the nodes manually, and when I left one of them for ~4hrs it actually joined eventually. While it was doing so it was making lots of TCP connections to these "ghost" nodes, and waiting around on timeouts.
The fix was to configure raft snapshots to happen more often, by setting the snapshot-interval value to 1000 (from a default of 10000). In my particular case, my swarms aren't terribly active, so it takes ~2 months for 10000 entries to be added to the raft log. Once I changed that setting (with the docker swarm update --snapshot-interval 1000 command), a snapshot was almost immediately created, and hosts added after that point joined and were fully communicative within a few seconds.
This is more of a workaround than a fix, because the raft log shouldn't have gotten in that state to begin with.
Hope that helps!