for-aws
for-aws copied to clipboard
ELB Issue Killed Docker Swarm Quorum due to Auto Scaling Group
Expected behavior
Loss of connectivity from the docker for AWS elastic load balancer would not cause the docker Swarm to permanently lose the quorum state.
Actual behavior
In a nutshell, the interaction between the load balancer health check and the autoscaling group configuration in Docker for AWS caused the manager nodes to be terminated all at once, or at least very close together, so that new manager nodes coming up could not regenerate the state quickly enough to maintain it over multiple restart cycles.
Information
Using Docker for AWS 17.12.1 CE
It looks like the autoscaling group configuration is set up to terminate a manager node if it becomes unhealthy, and then spin up a new one to replace it. This node should be able to fetch the state from the other managers in the quorum, and therefore maintain the quorum state of the cluster.
However, in a case where all of the healthchecks on the manager nodes fail at once, all of the manager nodes are terminated too quickly to maintain quorum state.
This state was caused when I attempted to update the load balancer security group to restrict traffic to my company's internal network. However, I did not realize that my update would also disallow all outbound traffic from the load balancer, which would cascade to make the Auto Scaling Group ELB healthchecks fail for the whole cluster. shakes fist at ansible
By the time I had figured out what happened, the cluster manager nodes had been terminated and recreated half a dozen times or so. When I fixed the ELB, the quorum state had been lost, including my running stacks, my docker secrets, and any non-cloudstor volumes. (Thank goodness the cloudstor volumes with the actual app data were fine.)
I can think of a simple fix for this, but I'm unsure if it would have any unintended consequences.
If the autoscaling group for the manager nodes (Docker-Swarm-Production-ManagerAsg-GGX7J7LI7L4S) would be set with a minimum of 1 nodes instead of a minimum of 0 nodes, it would make it near impossible for the swarm to lose a quorum due to the autoscaling group itself. I think this would still allow autoscaling to operate properly to deal with node-level issues, while making it more stable in the case of a cluster-wide or networking related issue such as this one.
Really interested to hear the expert opinions on this one, and happy to follow up on any questions you might have for me.
Steps to reproduce the behavior
- Change the configuration of the Docker for AWS external load balancer security group to not allow any outbound traffic.
- Watch the mayhem of terminating and regenerating instances for something like ~20 minutes (not sure how fast or if this always will occur)
Changing (incorrectly) the configuration of an ELB HealthCheck, will always cause the VMs to be killed and re-created. AutoScaling groups have no knowledge of the type of VMs it's running, nor do you have rules in place to say "wait until we have quorum".
Feel free to experiment with the ASG to see if setting the MinSize does prevent this issue. Happy to add it to our template if it does.
Not sure what are real consequences of my decision, but for manager ASG I changed HealthCheckType to EC2 (instead of ELB). Reasoning behind this was bitter experience, that one service removed from stack, was still checked for health by ELB. And it took down managers and cluster lost quorum. I run 3 managers back in the day, now I run 5. So it's possible that cluster would survive such operation.
@westfood no real consequence, except you don't know if docker is still alive on the machine. The idea behind the ELB check, was to confirm that our docker diagnose container was still up and running. If it died, then docker was probably done for as well. In practice the story was different, as you found out.