for-aws icon indicating copy to clipboard operation
for-aws copied to clipboard

Cluster continually failing health checks and restarting nodes

Open aldencolerain opened this issue 7 years ago • 32 comments

I launched a 1 master 2 worker cluster all micro instances (stable and edge had the same behavior) and after about a day both clusters were continually terminating and restarting ec2 instances due to failed ELB health checks. I tried increasing all of the health check numbers and wait periods. I'm letting cloud formation create a new VPC, not using an existing one.

I haven't provisioned anything on the cluster yet, so I don't think its a resource issue. I am able to curl the health check endpoint curl -I 172.31.18.132:44554/ from a master to worker etc. It seems like an issue with ELB or the VPC.

aldencolerain avatar Jun 19 '17 21:06 aldencolerain

I'm getting the same issue, this is in the EU-WEST-2 region (london).

I've tried ssh'ing into one of the EC2 nodes and curl -v http://localhost:44554 fails to connect?

This was a new cluster created using an unmodified CloudFormation template for 17.03.x (stable) with 3 managers and 0 workers

Only change was to create an RDS instance and add it to the same VPC.

I've still got the 'broken' cluster hanging around if someone wants to do some debugging

jebw avatar Jun 21 '17 10:06 jebw

Further to this - I just tried binding nginx to 44554 but docker complains the port is already in use - so something is bound to that but not responding?

tried finding what was using the port but without much success

~ $ lsof -i :44554
12	/bin/busybox	/dev/pts/0
12	/bin/busybox	/dev/pts/0
12	/bin/busybox	/dev/pts/0
12	/bin/busybox	/dev/tty
~ $ netstat  -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN      
tcp        0      0 localhost:52698         0.0.0.0:*               LISTEN      
tcp        0      0 a9764d3cac1a:ssh       XXXX.dyn.plus.net:64901 ESTABLISHED 
tcp        0      0 :::ssh                  :::*                    LISTEN      
tcp        0      0 localhost:52698         :::*                    LISTEN      
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix  3      [ ]         STREAM     CONNECTED      25900 
unix  3      [ ]         STREAM     CONNECTED      25901 

jebw avatar Jun 21 '17 10:06 jebw

So to be honest I gave up for the time being and built the stack manually, but for me it seems broken out of the box. I'll dig in a bit later when and see if I can get some traction.

aldencolerain avatar Jun 23 '17 19:06 aldencolerain

@jebw thank you for letting us know.

BTW 44554 is our diagnostic server, it runs directly on the host. If it stops responding the host would reboot. So it might be an issue with that server.

kencochrane avatar Jun 23 '17 19:06 kencochrane

@aldencolerain sorry for the trouble, we have had a similar report from someone else, and we are currently testing it out to see if we can reproduce. We are trying to find a common denominator, can you let us know which region you were using?

kencochrane avatar Jun 23 '17 19:06 kencochrane

@kencochrane Unfortunately I killed the broken cluster a few hours ago but I'm 95% sure there wasn't any diagnostics container visible from docker ps - I've had 2 clusters break like this though so it seems fairly reproduceable

Whats the sure the diagnostics server be visible from docker ps ?

jebw avatar Jun 23 '17 20:06 jebw

The diagnostic server doesn't run as a container, it runs directly on the host, and it monitors docker and other items. Because of this, it isn't visible from docker ps.

kencochrane avatar Jun 23 '17 20:06 kencochrane

@kencochrane Does that imply the diagnostics server is dying/stopping responding then?

Odd bit is new nodes have a failing health check as well

Process is

  1. Build cluster
  2. Everything runs fine initially
  3. Autoscaler health check fails on all nodes for some reason
  4. Nodes get killed and recreated by autoscaler
  5. Health check is broken from the get go on new EC2 instances
  6. Go to Step 4.

jebw avatar Jun 24 '17 15:06 jebw

Here is an update:

We started 3 clusters on Friday and let them run over the weekend. We had one in us-east-1 and 2 in us-west-2. We were able to reproduce what you described in us-east-1, but it didn't happen in us-west-2.

What we did:

  • Started three 17.06-rc5 CFN stacks with no changes, in a few different regions.

What I noticed.

  • The ELB health check was failing, and causing the Autoscaling group to recycle the node.
  • Manually running a curl against the health check returned a success (HTTP 200), and everything looked fine.
  • I changed the Auto Scale groups health check from ELB to EC2, this allowed me to more easily diagnose what is going on, without nodes getting shut down in the middle, this stabilized the nodes, which made it easier to see what is going on.
  • It was recycling the nodes so much, the swarm was lost. The primary manager IP address in dynamodb was no longer valid, and I fixed the swarm by deleting the node_type record in dynamodb, within a few minutes the swarm was back up, and stable.
  • TCPdump was not seeing any traffic from the ELB on the healthcheck endpoint. On a healthy node we were able to see the traffic just fine. docker run --rm --net=host marsmensch/tcpdump -vv -i eth0 port 44554 (so for some reason the ELB isn't able to connect to the nodes, or it stopped checking for some reason)

What I did next

Since the ELB's were not connecting to the nodes to ping the healthcheck endpoint, and I knew the healthcheck endpoints were fine. I started trying out some different things.

  1. I changed the security groups to see if making them more open would help. Even though the same security group setup works fine in other stacks, and earlier.
    • This had no effect
  2. I created a new ELB and connected to the same VPC/subnets, etc and configured the same as the original ELB.
    • The healthchecks started showing up in TCPDump and the nodes were marked as available. This tells me there is nothing wrong with the nodes, but it was the load balancer.
  3. I noticed that there was no listeners for the ELB, so I created a one on port 80 HTTP.
    • As soon as I started doing that, the nodes started showing up as "InService", in the ELB.

What does this mean

Taking what I leaned above, I think the issue is related to an issue with the ELB, and since it is only happening in some of the regions, it might be a recent change that they rolled out to ELBs that is now effecting us. Eventually the problem will happen in all the different regions, once they roll out the change to all regions.

I'm not 100% sure, but I think the issue is related to the fact that we have no listeners in the load balancer. If you create a load balancer in the web dashboard, it doesn't let you continue unless you add at least 1 listener. They must have rolled out a change that stops looking at nodes unless you have a listener configured for your ELB.

Why does this take so long for this to take effect? no idea, maybe there are admin tasks on the ELBs that do something every so many hours, and once this happens it reloads the config, and then the problem pops up. Or it is just a bug in the ELB, that we are now hitting.

What do we do next

With the current ELB config in the CloudFormation Template we add a listener on port 7, which is mostly to allow us to add the ELB without giving us an error. When the LBController starts up, it resets the configuration to what it has from swarm. Since port 7 isn't part of swarm, that listener is removed, which leaves us with no listeners.

One of the things we can do, is make sure the LBController doesn't remove all of the listeners, and worse case there is at least 1 listener (port 7), to help prevent this from happening in the first place.

Since we are not 100% this is the cause, it is hard to tell if this will fix the issue, but it is worth a try.

I'll report back if I notice anything else. Please let me know if you noticed anything, or incase I missed anything.

kencochrane avatar Jun 26 '17 18:06 kencochrane

/cc @cjyclaire Do you know if any ELB config changes were recently rolled out?

FrenchBen avatar Jun 27 '17 14:06 FrenchBen

@FrenchBen as far as I know, below are 2 latest changes for ELB and ELBv2(ALB) ELB May 11, ELBv2 26 days ago

The latest change for CloudFormation is 2 month ago.

I didn't hear about any ELB config changes for Cloudformation templates, yet feel free to cut a ticket to AWS support if anything surprise you happens : )

cjyclaire avatar Jun 27 '17 16:06 cjyclaire

Thanks for chiming in @cjyclaire - I'll try to recreate the above issue with a simple deployment and will open an issue if needed :)

FrenchBen avatar Jun 28 '17 08:06 FrenchBen

@kencochrane Sorry about the late reply, thank you for looking into the issue. Mine was breaking in the us-west-2 region. It's weird that it didn't happen for you there. One day it took abour 16 hours to break. The next day it only took 1 hour. What size instances did you use? Is anyone having this issue with larger instances?

aldencolerain avatar Jul 11 '17 04:07 aldencolerain

@aldencolerain we were able to reproduce the issue, and it seems to happen for us if there was no listeners listed in the ELB. As soon as you added a listener the health checks on the nodes worked again.We have put in a fix for 17.06, so hopefully it won't happen for you again.

kencochrane avatar Jul 11 '17 18:07 kencochrane

@kencochrane I am running 17.06 in us-east-2 and I am facing this exact same issue. I've collected diagnosis information from the builtin tool, if you'd like it.

obicons avatar Jul 11 '17 19:07 obicons

@madmax88 really, we haven't been able to reproduce with 17.06, so yes, any info you can send along is appreciated. Can you check to see if you have listeners in your ELB, and if so, which ones are listed? You can find it in the AWS ELB dashboard.

kencochrane avatar Jul 11 '17 19:07 kencochrane

@kencochrane Listeners:

Load Balancer Protocol | Load Balancer Port | Instance Protocol | Instance Port | Cipher | SSL Certificate TCP | 7 | TCP | 7 | N/A | N/A TCP | 443 | TCP | 443 | N/A | N/A TCP | 5000 | TCP | 5000 | N/A | N/A TCP | 5001 | TCP | 5001 | N/A | N/A TCP | 9000 | TCP | 9000 | N/A | N/A

obicons avatar Jul 11 '17 19:07 obicons

@madmax88 thanks, under instances, what does it have for statuses for the instances?

kencochrane avatar Jul 11 '17 19:07 kencochrane

@kencochrane A couple of managers have 1 of the healthchecks failing.

Here is a part of the system logs from one of the ones with a failing healthcheck:

Oops: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 7206 Comm: dockerd Not tainted 4.9.31-moby #1
Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
task: ffff9fd07b03b180 task.stack: ffffb0024233c000
RIP: 0010:[<ffffffff946453db>]  [<ffffffff946453db>] sk_filter_uncharge+0x5/0x31
RSP: 0018:ffffb0024233fe10  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff9fd07d95eea8 RCX: 0000000000000006
RDX: 00000000ffffffff RSI: 00000000ffffffe5 RDI: ffff9fd07d95ec00
RBP: ffff9fd07d95ec00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff9fd07b03b180 R12: ffff9fd07d95ec00
R13: ffff9fd07b6e6ca8 R14: ffff9fd07d95ef40 R15: 0000000000000000
FS:  00007fce4cff9700(0000) GS:ffff9fd08fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000fffffffd CR3: 00000001fc5aa000 CR4: 00000000001406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 ffffffff9461ba01 ffff9fd07b6e6c00 ffffb0024233fe88 ffff9fd07d95ec00
 ffffffff94760bef 010000087b82d380 ffff9fd0853b53a0 ffff9fd082faf180
 ced5c46e9c728ced ffff9fd07fc7ec80 0000000000000000 ffff9fd07fc7ecb0
Call Trace:
 [<ffffffff9461ba01>] ? __sk_destruct+0x35/0x133
 [<ffffffff94760bef>] ? unix_release_sock+0x180/0x212
 [<ffffffff94760c9a>] ? unix_release+0x19/0x25
 [<ffffffff94616cf9>] ? sock_release+0x1a/0x6c
 [<ffffffff94616d59>] ? sock_close+0xe/0x11
 [<ffffffff941f7425>] ? __fput+0xdd/0x17b
 [<ffffffff940f538a>] ? task_work_run+0x64/0x7a
 [<ffffffff94003285>] ? prepare_exit_to_usermode+0x7d/0x96
 [<ffffffff9482a184>] ? entry_SYSCALL_64_fastpath+0xa7/0xa9
Code: 08 4c 89 e7 e8 fb f8 ff ff 48 3d 00 f0 ff ff 77 06 48 89 45 00 31 c0 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 <48> 8b 46 18 8b 40 04 48 8d 04 c5 28 00 00 00 f0 29 87 24 01 00 
RIP  [<ffffffff946453db>] sk_filter_uncharge+0x5/0x31
 RSP <ffffb0024233fe10>
CR2: 00000000fffffffd
---[ end trace 77637a5620196472 ]---
Kernel panic - not syncing: Fatal exception

obicons avatar Jul 11 '17 19:07 obicons

@madmax88 we've identified a similar issue and have implemented a fix for it - A patch will be available soon.

FrenchBen avatar Jul 11 '17 20:07 FrenchBen

Hey all, this morning I had a stroke of good luck and witnessed what's been going on from start -> finish.

Here's the rundown of what I saw: (1) A manager experiences a kernel panic, like the one I posted earlier (2) That instance is not marked as unhealthy in the EC2 console (3) A new manager instance is provisioned (4) Now, there are more managers than desired in the swarm. Rather than terminating the unreachable manager, a different manager (randomly? I'm not sure) is terminated (5) Managers continue to scale up/down

If you all need any more information about this, please let me know.

obicons avatar Jul 14 '17 12:07 obicons

@madmax88 awesome, thank you for letting us know, that is very helpful, and could explain a few things. Hopefully we will have a new version that fixes the kernel panic out there shortly.

kencochrane avatar Jul 14 '17 12:07 kencochrane

@kencochrane Thanks for the fast replies!

By any chance do you have an approximate date for that fix?

obicons avatar Jul 14 '17 12:07 obicons

@madmax88 trying to find that out now, it should be very soon.

kencochrane avatar Jul 14 '17 13:07 kencochrane

Whilst solving the kernel panic resolves the trigger - underlying this is the issue of why a paniced manager isn't getting cleaned up? Isn't that also something which needs some kind of fix?

jebw avatar Jul 14 '17 14:07 jebw

@jebw yes, we are looking at that as well. unfortunately a lot of the decision making is done by the ASG, so we are a little limited in what we can do. For example, when there are two many managers in a pool, we have no control over which manager gets removed.

We need to figure out why the node wasn't marked as down in the EC2 console. I think if we can solve that, then it will help the other issue.

kencochrane avatar Jul 14 '17 14:07 kencochrane

And to be clear, we do have a little control over what happens when an ASG needs to scale down, but the options available, are none that work for what we need. If there is a way to programmatically decide, it would make it much easier for us. http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html

kencochrane avatar Jul 14 '17 14:07 kencochrane

@kencochrane that sounds great, I appreciate your bounded by what aws allows. Just wanted to check the underlying is was going to get considered. Thanks for the great product

jebw avatar Jul 14 '17 14:07 jebw

Just had this happen to me. What is the best way to recover from this issue and when is an update to the cloud formation template going to be released for this issue. So far I just have a test Swarm but would like to go to production, which I cannot do with a major bug like this.

RehanSaeed avatar Aug 16 '17 13:08 RehanSaeed

@RehanSaeed Do you mind joining our community slack? https://dockr.ly/community

FrenchBen avatar Aug 17 '17 21:08 FrenchBen