containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [request]: NLB forwards to "unused" targets

Open sveniu opened this issue 4 years ago • 5 comments

Tell us about your request After updating an NLB listener to point to another target group, the NLB keeps sending requests (new connection per request, but probably same flow) to the previous targets for ~75 seconds.

Which service(s) is this request for? NLB, ECS, Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I'm using CodeDeploy + ECS blue/green deploys, where a new task set is registered with the green target group, and the test and production NLB listener rules are updated to point to the new green target group. This leaves the previous target group with all targets in state "unused".

At that point, I would expect no more requests to the old targets, but connections are still sent to them for ~75 seconds.

This seems to be related to #469 and a general lag in how the NLB interacts with the target group and health checks.

Issuing a rollback using the CodeDeploy+ECS blue/green deployment, is also giving me an outage: The NLB works ok for 10-20 seconds, then starts rejecting connections (TCP reset), then starts timing out, and only 30+ seconds later it starts replying correctly. I'm still researching this.

Additional context I can only speculate that this is related to how the NLB handles traffic flows, and the flow cache is not flushed upon changes to the listener rule and/or target group targets.

sveniu avatar Sep 06 '19 14:09 sveniu

We're seeing the same behavior when using NLB for EKS.

We've rigorously tested the issues:

  1. The NLB keeps on sending traffic to targets that have already been deemed as deregistered (they don't appear as targets anymore). Traffic can still hit those targets up to 180 seconds after those targets finished de-registering (at least from AWS console and aws cli listing of targets).

We've managed to compensate on this by setting preStop hook to at least 360 seconds (!), this only helps during e.g. deployment rollouts. This makes rollouts take a very long time to completely, not desirable.

  1. NLBs TCP based healthchecks cannot be set to lower values, for example the minimums available when using the ALB. In addition, and most problematic is the fact that if a target is deemed unhealthy (2 checks x 10s interval = within 20 seconds), the NLB still keeps on sending traffic to it up to 70 seconds after it was deemed unhealthy.

  2. NLB takes around 180 seconds to register new targets: this is a problem when we're using Cluster Autoscaler and expect the cluster to scale up quickly enough e.g. to handle extra traffic.

The experience can be improved somewhat by using instance targets. However, that is not desirable for us (extra network activity and load on kube-proxy pods). In addition, the same problems persist: slow registration of new targets, NLB keeps on sending traffic to failed target long after it was deemed unhealthy, slow de-registration.

nuriel77 avatar Sep 03 '21 09:09 nuriel77

I see this remains open after 3+ years, which seems very odd, given that it clearly is causing a lot of grief for people. I wonder if most people that are affected, are simply not aware. That the issue is kept open, and even temporarily assigned, I take as a signal of it being a real issue, though. Puzzling!

I made a mostly-sarcastic, slightly-altruistic NLB documentation PR in https://github.com/awsdocs/elb-network-load-balancers-user-guide/pull/11 that sort of backfired (it was merged with little modification) and left me with a big regret that, with this erroneous behaviour now described in the docs, it kind of excuses it and gives AWS support something to point to.

sveniu avatar Sep 03 '21 19:09 sveniu

@sveniu thank you.

I wonder if most people that are affected, are simply not aware.

I wondered that too. I guess that many users do not thoroughly test outage scenarios nor do they use autoscaling. From what I've understood, some users just switched to ALB which solved the problems for them. I was forced to use NLB for a certain use case, and I require an environment which quickly responds to changes (scaling, recovery, high availability, minimizing any type of outage for users) - which is what K8S pretty much covers. However, in combination with NLB it doesn't seem to provide that agility. I clearly see the advantages of using the NLB for EKS/ECS, and hope that AWS will rectify those issues soon.

nuriel77 avatar Sep 04 '21 07:09 nuriel77

Yup, count me in on finding this behaviour both odd and mostly unacceptable.

It's one thing if the NLB health check system cannot be improved (that's just a bummer) but why does the problem persist with CodeDeploy? The current design allows for blue/green targets to be flipped when the registration has not completed leading to failed requests until the target is healthy. Isn't the entire point of blue/green deploys to ensure the application is responding properly and available before making the flip?

This problem does not happen with ALB's, I have only ever witnessed it when attempting to use NLB's + CodeDeploy. Unfortunately, I have some use cases where I also have to use an NLB.

travisbell avatar May 17 '22 17:05 travisbell

using ALB instead of NLB solved the issue

roimor avatar Jul 20 '22 12:07 roimor

FWIW, it is still an issue in the second half of 2023.

It takes about 30-40 seconds for new connections to stop being sent to the target after it gets marked as Unhealthy as a result of failed health checks. Any decent load balancer, software or hardware, has been able to stop sending new connections immediately after marking the target as unhealthy for decades, but NLB, with all due respect, seems not to be one of them.

It also takes about two minutes (and that if the deregistration delay is set to a minimum or even zero -- the effective minimum delay seems to be 120 seconds) for the connections to stop coming to the target being drained after the deregistration call is made, which contradicts the documentation, which states that new connections stop as soon as the target begins draining.

It means (and it can be used as a partial workaround) that it's better to make the target fail health checks when you need it to be taken out of service than to use the DeregisterTargets API call, supposedly made for this purpose, because the former method stops traffic coming to the target faster.

This is frustrating, especially considering how long this issue has been around, and it requires to implement ugly workarounds to properly handle targets decommissioning in, for example, the spot instance interruption warning processing scenario where you only have 2 minutes before the instance gets terminated to gracefully remove it from the load balancer and stop everything that runs on it.

p.s. it's not limited to ECS. In my case it's plain EC2 instances and instance-type targets.

shapirus avatar Aug 31 '23 18:08 shapirus

Bumping this up. We've got the same issue with NLB. We use it as 1st layer of LB, second (NLB targets) are Nginx pods on Kubernetes. We got to the point where pod termination preStop hook must wait at least 120seconds before draining connections from Nginx because NLB is fowarding new connections to deregistered target during this period.

So to sum up: NLB forwards new connections to target that is deregistered and not even visible anymore (target dissapear after 10-20 seconds, deregistration delay is set to 5 seconds) in the TargetGoup through 120 seconds. If target will be terminated before that, we would have broken connections.

krzwiatrzyk-lgd avatar Jan 05 '24 12:01 krzwiatrzyk-lgd