aws-node-termination-handler Handle ELB instance deregistration

We've noticed in our production environment that we have a need for something to deregister nodes from load balancers as part of the draining procedure, before the instance is terminated. We're currently using lifecycle-manager for this, but it would be nice if this was handled by the AWS Node Termination Handler instead.

The reason this is needed is that if the instance is terminated before it's deregistered from an ELB, a number of connections will fail until the health check starts failing. This is particularly noticeable on ELBv2 (NLB+ALB), which seem to take several minutes to react, so we need to have fairly high timeout times on the health checks.

The behaviour we're looking for is that the node termination handler finds a list of classic ELBs and target groups that it's a member of, sends a deregistration request and then waits for the deregistration to finish before marking the instance as being ready to terminate.

Dec 11 '20 10:12 sarahhodne

same issue here, in addition, when termination handler cordons node, the node marked as unschedulable, the service controller removes cordoned nodes from LB pools, it can potentially drop in-flight requests, there should be a better process for a node draining,

taint the node (don't cordon)
find elb's/target groups and safely de-register the node
cordon
drain

relevant issues: https://github.com/kubernetes/autoscaler/issues/1907 https://github.com/kubernetes/kubernetes/issues/65013 https://github.com/kubernetes/kubernetes/issues/44997

and a partial bug fix in 1.19 https://github.com/kubernetes/kubernetes/pull/90823

Dec 16 '20 16:12 yuri-1987

I'm definitely interested in looking into this more. I've asked @kishorj who works on the aws-load-balancer-controller his thoughts since there needs to be a careful dance between the LB controller and NTH in the draining process. There might be more we can do in that controller without involving NTH as much. But if we need to add this logic to NTH, then I'm not opposed

Dec 16 '20 23:12 bwagner5

Hi Brandon, thank you for the quick response. I think an external tool such as NTH is suitable to handle such logic. Even if Kubernetes contributors solve it internally, it won't manage all cases such as draining due to spot interruptions, AZ rebalance, or spot recommendations. The current bug of removing cordon nodes immediately from the load balancer is four years old, if the service controller will be enhanced someday, it can take a lot of time till we can use it. I really hope to see this functionality in NTH.

Dec 17 '20 17:12 yuri-1987

Linking taint-effect issue, since I think that would mitigate this: https://github.com/aws/aws-node-termination-handler/issues/273

Feb 08 '21 18:02 bwagner5

I'm not sure it would really do what we need. The problem is that draining instances from an ELBv2 load balancer is quite slow (usually 4-5 minutes in our experience), and, at least for our nodes, draining the containers is much, much faster.

lifecycle-manager is nice because it polls to make sure the instance is removed from the load balancer before it continues. If I'm reading the taint-effect issue right, it would apply a taint, which could cause an ELB drain to start, but there's not really anything that then waits for the drain to finish before the instances are terminated?

Feb 08 '21 19:02 sarahhodne

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

Oct 17 '21 17:10 github-actions[bot]

Can we get update on this? This would be a cool feature!

Nov 17 '21 22:11 danquack

we are trying to find solution for the same problem for cluster autoscaler, 1.18 and previous k8 versions used to remove node from LBs with cordon command. We want similar behaviour to retain in 1.19+ k8s, one option is to have cluster autoscaler add below label to worker node or delete the worker node with kubectl delete node will remove the node from all associated k8 LBs

node.kubernetes.io/exclude-from-external-load-balancers=true (value doesnt matter, can be true/false or anything)

Jan 02 '22 06:01 infa-ddeore

With the custom termination policy supported by EC2 Auto Scaling, you would specify a Lambda function that can drain the node as well as deregister it from an ELB. This can be a solution until ELB deregistration is natively supported.

Refer to the following links for more details:

https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-ec2-auto-scaling-now-lets-you-control-which-instances-to-terminate-on-scale-in/
https://docs.aws.amazon.com/autoscaling/ec2/userguide/lambda-custom-termination-policy.html

Jan 03 '22 21:01 farooqashraf1

Interested to hear from contributors here whether the solution in https://github.com/aws/aws-node-termination-handler/pull/582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.

Does that solve your problem, or do we need to do additional work to support your use cases?

Feb 11 '22 20:02 snay2

It does not, in our case. The problem is that all the pods can be drained off the node faster than the node can be deregistered from the load balancer. So something like this happens:

Instance drain starts. The node.kubernetes.io/exclude-from-external-load-balancers label is added, and the load balancer controller starts the deregistration process.
The node termination handler evicts all the pods from the node.
Once the pods are all evicted, the node is terminated, but it is not yet deregistered from the ELB.
The instance is terminated, but the ELB continues to send requests to it, until either the deregistration finishes, or the health check trips.
Finally, the ELB termination finishes.

In our experience, and after working with AWS support, the shortest duration we've been able to get load balancer deregistration down to is 2-3 minutes. Meanwhile we can usually evict all pods in less than 1 minute.

Feb 14 '22 18:02 sarahhodne

@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.

I use the aws-load-balancer-controller to provision my load balancers.

Feb 14 '22 20:02 tjs-intel

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.

Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs

in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

Feb 15 '22 05:02 infa-ddeore

I use the aws-load-balancer-controller to provision my load balancers.

Do you run it in IP or Instance mode @tjs-intel ?

May 03 '22 13:05 kristofferahl

@kristofferahl I switched from Instance to IP mode because of general lack of support for node draining by NTH and brupop

May 03 '22 13:05 tjs-intel

@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.

Thanks @tjs-intel! We use IP mode as well so I was wondering if you could possibly explain your setup a bit further as it seems you're not having any issues with dropped requests when using aws-load-balancer-controller and NTH? How do you achieve draining before the target/underlying instance is terminated?

May 03 '22 14:05 kristofferahl

@sarahhodne I think Remove nodes with Cluster Autoscaler taint from LB backends in service controller #105946 fixes the issue

Feb 27 '24 12:02 pierpaolo-pagnoni

We found a pretty nice way to handle this with Graceful Node Shutdown and preStop hooks on daemonsets. Essentially you set the kubelet parameters (in our case we use karpenter, so we used specified userData in the EC2NodeClass as follows

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  userData: |
    #!/bin/bash -xe
    echo "$(jq '.shutdownGracePeriod="400s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    echo "$(jq '.shutdownGracePeriodCriticalPods="100s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

and then deploy a daemonset on all karpenter nodes with a high terminationGracePeriodSeconds and preStop hook

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: karpenter-termination-waiter
  namespace: kube-system
  labels:
    k8s-app: karpenter-termination-waiter
spec:
  selector:
    matchLabels:
      name: karpenter-termination-waiter
  template:
    metadata:
      labels:
        name: karpenter-termination-waiter
    spec:
      nodeSelector:
        karpenter.sh/registered: "true"
      containers:
        - name: alpine
          image: alpine:latest
          command: ["sleep", "infinity"]
          # wait for the node to be completely deregistered from the load balancer
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "300"]
          resources:
            limits:
              cpu: 5m
              memory: 10Mi
            requests:
              cpu: 2m
              memory: 5Mi
      priorityClassName: high-priority
      terminationGracePeriodSeconds: 300

the node is still running aws-node and kube-proxy behind the scene, so it can properly direct requests from the load balancer until it's completely drained. It's important that the gracePeriod and sleep hook is larger than the deregistration delay on the ALB so the node isn't terminated before being fully drained.

Feb 28 '24 22:02 TaylorChristie

@TaylorChristie similar issue with karpenter is being discussed at https://github.com/aws/karpenter-provider-aws/issues/4673

in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but karpenter-termination-waiter daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?

we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution

Feb 29 '24 02:02 deepakdeore2004

@TaylorChristie similar issue with karpenter is being discussed at aws/karpenter-provider-aws#4673

in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but karpenter-termination-waiter daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?

we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution

Yep, because of the shutdownGracePeriod set in kubelet, it won't drain any daemonsets like kube-proxy or aws-node (since it is higher priority), so the nodes can still properly forward NodePort traffic to other endpoints. I agree a native karpenter solution would be much better, but in our testing this eliminates the LB 5XX issues we were experiencing.

Feb 29 '24 04:02 TaylorChristie

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here. Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs

in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.

Mar 20 '24 00:03 oridool

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here. Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.

there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller

i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not

Mar 20 '24 16:03 deepakdeore2004

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here. Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.

there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller

i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not

@infa-ddeore I checked that indeed ALB is removing the node from node group when I set this label of node.kubernetes.io/exclude-from-external-load-balancers. Any chance you (or your developers) can publish a PR for that? I think a lot of people would need it.

Mar 21 '24 14:03 oridool

Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes. It might be helpful to other people. What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting --node-delete-delay-after-taint=60s You can read more about it here

Explanation: When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s. 5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API. ALB is not aware of the fact that the node is going to be terminated by ASG. Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted. To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using node-delete-delay-after-taint parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.

To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.

with-node-delete-delay-after-taint-mode

Apr 10 '24 07:04 oridool

Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes. It might be helpful to other people. What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting --node-delete-delay-after-taint=60s You can read more about it here

Explanation: When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s. 5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API. ALB is not aware of the fact that the node is going to be terminated by ASG. Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted. To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using node-delete-delay-after-taint parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.

To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.

thanks for the details @oridool, i see aws lb controller understands ToBeDeletedByClusterAutoscaler and removes the node from LB when this taint is added, also --node-delete-delay-after-taint option will keep the node alive for specified duration, this is perfect solution

but we use in-tree controller which doesnt understand this taint so the cluster autoscaler customization is needed from our side

Apr 11 '24 04:04 deepakdeore2004