for-aws
for-aws copied to clipboard
stack deploy causes nodes to randomly fail health checks making the stack unusable
Expected behavior
I followed the getting started guide and deployed a docker swarm a few months ago. There were a few hiccups, but docker stack deploy --compose-file docker-compose.yml stack --with-registry-auth mostly worked well until June 28 (maybe it has to do with docker 17.06 being released then?). Not sure if this matters, but as per the dockerfile below, we are using the same image to launch multiple services (with different commands).
Actual behavior
Since June 28, most (but not all!) deploys cause nodes to be randomly restarted by the ELB health check, making everything unusable.
Updating service stack_service1 (id: ln6mpm5gxzxa7sn6suibr78kx)
Updating service stack_service2 (id: 00aouqsrr8sm8tvbgkuvdyyql)
Updating service stack_service3 (id: k4ze8u9mj8ppkyfphjvc4v7ez)
Updating service stack_service4 (id: v8tu2gisouxewxhvn4uoiio8r)
Updating service stack_visualizer (id: jd7wlqxdxkd74ogl9uemwfhxv)
Error response from daemon: rpc error: code = 2 desc = update out of sequence
Sometimes, the swarm is left with two few managers:
Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Information
When this first happened, we had ~1 hour of downtime and the hot fix was to just deploy a fresh swarm using the cloudformation template. But the same thing happened when we tried to update that one. We've now switched to a manual setup.
- Full output of the diagnostics from "docker-diagnose" ran from one of the instance
OK hostname=ip-172-31-9-76-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
OK hostname=ip-172-31-29-59-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
OK hostname=ip-172-31-17-121-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
OK hostname=ip-172-31-42-178-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
curl: (7) Failed to connect to 172.31.42.169 port 44554: Connection refused
OK hostname=ip-172-31-41-135-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
Done requesting diagnostics.
Your diagnostics session ID is 1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
Please provide this session ID to the maintainer debugging your issue.
- A reproducible case if this is a bug, Dockerfiles FTW docker-compose.yml
version: "3"
services:
service1:
image: user/same_repo
networks:
- webnet
ports:
- "443:443"
logging:
driver: "json-file"
deploy:
replicas: 2
restart_policy:
condition: on-failure
update_config:
parallelism: 1
delay: 10s
command: gunicorn -c app/src/wsgi/gunicorn_conf.py app.src.wsgi.wsgi:app
service2:
image: user/same_repo
networks:
- webnet
ports:
- "8001:8001"
logging:
driver: "json-file"
deploy:
replicas: 1
restart_policy:
condition: on-failure
update_config:
parallelism: 1
delay: 10s
command: gunicorn -c app/jobs/wsgi/gunicorn_conf.py app.jobs.wsgi.wsgi:app
visualizer:
image: dockersamples/visualizer:stable
ports:
- "8090:8080"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
deploy:
placement:
constraints: [node.role == manager]
networks:
- webnet
service3:
image: user/other_repo
deploy:
replicas: 1
restart_policy:
condition: on-failure
logging:
driver: "json-file"
networks:
- webnet
ports:
- "7001:7001"
- "6000:6000"
- "6001:6001"
networks:
webnet:
- Page URL if this is a docs issue or the name of a man page
Steps to reproduce the behavior
- ...
- ...
@grigored sorry for your trouble. A few questions.
- what version of docker for aws are you using?
- If you log into the aws console and look at the ELB, do you see any configured listeners? If so, what do you see. If not, does the ELB think the nodes attached to it are available or unavailable?
thanks for your help.
- I created the cloudformation last Friday clicking on
Deploy Docker Community Edition (CE) for AWS (stable)herehttps://docs.docker.com/docker-for-aws/#quickstart. The version is Docker CE for AWS 17.06.0-ce (17.06.0-ce-aws1) - Yesterday I deleted the docker for aws cloudformation stack, and manually deployed my containers (I don't want to have two stacks running at the same time, as we have some cron tasks that access the same database). I just started a new stack just so I can answer your questions.
Yesterday I was seeing this when running docker node ls
5fyh6kxsa13be9spn7msgyyjn ip-172-31-42-178.ec2.internal Down Active
6uwasv5kzcwer42ramyw5cvm2 ip-172-31-12-255.ec2.internal Down Active Unreachable
brmmxk57sit9gci0517vxzgns * ip-172-31-24-173.ec2.internal Ready Active Reachable
ixrhl4squouvuohat2nt4v7aj ip-172-31-9-177.ec2.internal Down Active
prajsbt7h3rs11757zdko75bu ip-172-31-44-195.ec2.internal Ready Active Reachable
vfe0aqxpg3z4twck700hqy2gq ip-172-31-5-185.ec2.internal Ready Active Leader
The ELB has these listeners configured:
TCP | 7 | TCP | 7 | N/A | N/A
-- | -- | -- | -- | -- | --
TCP | 443 | TCP | 443 | N/A | N/A
TCP | 6000 | TCP | 6000 | N/A | N/A
TCP | 6001 | TCP | 6001 | N/A | N/A
TCP | 7001 | TCP | 7001 | N/A | N/A
TCP | 8002 | TCP | 8002 | N/A | N/A
TCP | 8090 | TCP | 8090 | N/A | N/A
Load Balancer Protocol | Load Balancer Port | Instance Protocol | Instance Port | Cipher | SSL Certificate
-- | -- | -- | -- | -- | --
TCP | 7 | TCP | 7 | N/A | N/A
TCP | 443 | TCP | 443 | N/A | N/A
TCP | 6000 | TCP | 6000 | N/A | N/A
TCP | 6001 | TCP | 6001 | N/A | N/A
TCP | 7001 | TCP | 7001 | N/A | N/A
TCP | 8002 | TCP | 8002 | N/A | N/A
TCP | 8090 | TCP | 8090 | N/A | N/A
The ELB used to think some nodes were unavailable, as I was seeing this in Auto Scaling Groups:
Description:DescriptionTerminating EC2 instance: i-0cf93a5913fbed2ed Cause:CauseAt 2017-07-17T00:24:56Z an instance was taken out of service in response to a ELB system health check failure.
@grigored thanks for the info, this might be related to a kernel issue we found with docker for aws 17.06, and we should have an update released shortly that will fix the issue. I'll let you know when it is available.
Ok, thanks for looking into this, please keep me posted.
We have the same problem, same version of Docker (17.06.0-ce) - I have had several swarms fail on me in this exact way. A manager becomes unresponsive, the ELB removes it the auto scaling group creates another, that can't register into the swarm. The only way to 'fix' it is to recreate the swarm, which creates quite an inconvenience for us and our users.
Hello everyone - We've updated our latest release to have the kernel patch that ken mentioned. Please let us know if you experience the same issue.
So to use the last update we should just run Deploy Docker Community Edition (CE) for AWS (stable) here https://docs.docker.com/docker-for-aws/#quickstart ?
@grigored yes, that is correct.
Great, will give it a try tonight and let you know how it goes in the following days.
I got around this problem by banning my autoscaling group from being able to bring down nodes. (I added "ReplaceUnhealthy" and "Terminate" to the list of Suspended Processes for the CloudFormation-generated ASG).
I didn't switch back to docker for aws yet, but wanted to do that soon. So, @sshorkey, you're saying the problem still happens nowadays?
@grigored We'll be changing the healthcheck around to be focused on the node rather than the docker service, preventing some of the churning @sshorkey is seeing.