for-aws stack deploy causes nodes to randomly fail health checks making the stack unusable

Expected behavior

I followed the getting started guide and deployed a docker swarm a few months ago. There were a few hiccups, but docker stack deploy --compose-file docker-compose.yml stack --with-registry-auth mostly worked well until June 28 (maybe it has to do with docker 17.06 being released then?). Not sure if this matters, but as per the dockerfile below, we are using the same image to launch multiple services (with different commands).

Actual behavior

Since June 28, most (but not all!) deploys cause nodes to be randomly restarted by the ELB health check, making everything unusable.

Updating service stack_service1 (id: ln6mpm5gxzxa7sn6suibr78kx)
Updating service stack_service2 (id: 00aouqsrr8sm8tvbgkuvdyyql)
Updating service stack_service3 (id: k4ze8u9mj8ppkyfphjvc4v7ez)
Updating service stack_service4 (id: v8tu2gisouxewxhvn4uoiio8r)
Updating service stack_visualizer (id: jd7wlqxdxkd74ogl9uemwfhxv)
Error response from daemon: rpc error: code = 2 desc = update out of sequence

Sometimes, the swarm is left with two few managers:

Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

Information

When this first happened, we had ~1 hour of downtime and the hot fix was to just deploy a fresh swarm using the cloudformation template. But the same thing happened when we tried to update that one. We've now switched to a manual setup.

Full output of the diagnostics from "docker-diagnose" ran from one of the instance

OK hostname=ip-172-31-9-76-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
OK hostname=ip-172-31-29-59-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
OK hostname=ip-172-31-17-121-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
OK hostname=ip-172-31-42-178-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
curl: (7) Failed to connect to 172.31.42.169 port 44554: Connection refused
OK hostname=ip-172-31-41-135-ec2-internal session=1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx
Done requesting diagnostics.
Your diagnostics session ID is 1500250718-yhymVrXtQdTg84WRrOWECkVXDmRYwpDx

Please provide this session ID to the maintainer debugging your issue.

A reproducible case if this is a bug, Dockerfiles FTW docker-compose.yml

version: "3"
services:
  service1:
    image: user/same_repo
    networks:
      - webnet
    ports:
      - "443:443"
    logging:
      driver: "json-file"
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        delay: 10s
    command: gunicorn -c app/src/wsgi/gunicorn_conf.py app.src.wsgi.wsgi:app

  service2:
    image: user/same_repo
    networks:
      - webnet
    ports:
      - "8001:8001"
    logging:
      driver: "json-file"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        delay: 10s
    command: gunicorn -c app/jobs/wsgi/gunicorn_conf.py app.jobs.wsgi.wsgi:app

  visualizer:
    image: dockersamples/visualizer:stable
    ports:
      - "8090:8080"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
    deploy:
      placement:
        constraints: [node.role == manager]
    networks:
      - webnet

  service3:
    image: user/other_repo
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    logging:
      driver: "json-file"
    networks:
      - webnet
    ports:
      - "7001:7001"
      - "6000:6000"
      - "6001:6001"

networks:
  webnet:

Page URL if this is a docs issue or the name of a man page

Steps to reproduce the behavior

...
...

Jul 17 '17 03:07 grigored

@grigored sorry for your trouble. A few questions.

what version of docker for aws are you using?
If you log into the aws console and look at the ELB, do you see any configured listeners? If so, what do you see. If not, does the ELB think the nodes attached to it are available or unavailable?

Jul 17 '17 13:07 kencochrane

thanks for your help.

I created the cloudformation last Friday clicking on Deploy Docker Community Edition (CE) for AWS (stable) here https://docs.docker.com/docker-for-aws/#quickstart. The version is Docker CE for AWS 17.06.0-ce (17.06.0-ce-aws1)
Yesterday I deleted the docker for aws cloudformation stack, and manually deployed my containers (I don't want to have two stacks running at the same time, as we have some cron tasks that access the same database). I just started a new stack just so I can answer your questions.

Yesterday I was seeing this when running docker node ls

5fyh6kxsa13be9spn7msgyyjn     ip-172-31-42-178.ec2.internal   Down                Active              
6uwasv5kzcwer42ramyw5cvm2     ip-172-31-12-255.ec2.internal   Down                Active              Unreachable
brmmxk57sit9gci0517vxzgns *   ip-172-31-24-173.ec2.internal   Ready               Active              Reachable
ixrhl4squouvuohat2nt4v7aj     ip-172-31-9-177.ec2.internal    Down                Active              
prajsbt7h3rs11757zdko75bu     ip-172-31-44-195.ec2.internal   Ready               Active              Reachable
vfe0aqxpg3z4twck700hqy2gq     ip-172-31-5-185.ec2.internal    Ready               Active              Leader

The ELB has these listeners configured:

TCP | 7 | TCP | 7 | N/A | N/A
-- | -- | -- | -- | -- | --
TCP | 443 | TCP | 443 | N/A | N/A
TCP | 6000 | TCP | 6000 | N/A | N/A
TCP | 6001 | TCP | 6001 | N/A | N/A
TCP | 7001 | TCP | 7001 | N/A | N/A
TCP | 8002 | TCP | 8002 | N/A | N/A
TCP | 8090 | TCP | 8090 | N/A | N/A

Load Balancer Protocol | Load Balancer Port | Instance Protocol | Instance Port | Cipher | SSL Certificate
-- | -- | -- | -- | -- | --
TCP | 7 | TCP | 7 | N/A | N/A
TCP | 443 | TCP | 443 | N/A | N/A
TCP | 6000 | TCP | 6000 | N/A | N/A
TCP | 6001 | TCP | 6001 | N/A | N/A
TCP | 7001 | TCP | 7001 | N/A | N/A
TCP | 8002 | TCP | 8002 | N/A | N/A
TCP | 8090 | TCP | 8090 | N/A | N/A

The ELB used to think some nodes were unavailable, as I was seeing this in Auto Scaling Groups: Description:DescriptionTerminating EC2 instance: i-0cf93a5913fbed2ed Cause:CauseAt 2017-07-17T00:24:56Z an instance was taken out of service in response to a ELB system health check failure.

Jul 17 '17 14:07 grigored

@grigored thanks for the info, this might be related to a kernel issue we found with docker for aws 17.06, and we should have an update released shortly that will fix the issue. I'll let you know when it is available.

Jul 17 '17 15:07 kencochrane

Ok, thanks for looking into this, please keep me posted.

Jul 17 '17 15:07 grigored

We have the same problem, same version of Docker (17.06.0-ce) - I have had several swarms fail on me in this exact way. A manager becomes unresponsive, the ELB removes it the auto scaling group creates another, that can't register into the swarm. The only way to 'fix' it is to recreate the swarm, which creates quite an inconvenience for us and our users.

Jul 17 '17 17:07 vpetroff

Hello everyone - We've updated our latest release to have the kernel patch that ken mentioned. Please let us know if you experience the same issue.

Jul 17 '17 18:07 FrenchBen

So to use the last update we should just run Deploy Docker Community Edition (CE) for AWS (stable) here https://docs.docker.com/docker-for-aws/#quickstart ?

Jul 17 '17 19:07 grigored

@grigored yes, that is correct.

Jul 17 '17 19:07 kencochrane

Great, will give it a try tonight and let you know how it goes in the following days.

Jul 17 '17 20:07 grigored

I got around this problem by banning my autoscaling group from being able to bring down nodes. (I added "ReplaceUnhealthy" and "Terminate" to the list of Suspended Processes for the CloudFormation-generated ASG).

Oct 10 '17 14:10 OmpahDev

I didn't switch back to docker for aws yet, but wanted to do that soon. So, @sshorkey, you're saying the problem still happens nowadays?

Oct 10 '17 15:10 grigored

@grigored We'll be changing the healthcheck around to be focused on the node rather than the docker service, preventing some of the churning @sshorkey is seeing.

Oct 12 '17 14:10 FrenchBen

for-aws for-aws copied to clipboard

stack deploy causes nodes to randomly fail health checks making the stack unusable

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

for-aws
for-aws copied to clipboard