aws-node-termination-handler
aws-node-termination-handler copied to clipboard
Handle ELB instance deregistration
We've noticed in our production environment that we have a need for something to deregister nodes from load balancers as part of the draining procedure, before the instance is terminated. We're currently using lifecycle-manager for this, but it would be nice if this was handled by the AWS Node Termination Handler instead.
The reason this is needed is that if the instance is terminated before it's deregistered from an ELB, a number of connections will fail until the health check starts failing. This is particularly noticeable on ELBv2 (NLB+ALB), which seem to take several minutes to react, so we need to have fairly high timeout times on the health checks.
The behaviour we're looking for is that the node termination handler finds a list of classic ELBs and target groups that it's a member of, sends a deregistration request and then waits for the deregistration to finish before marking the instance as being ready to terminate.
same issue here, in addition, when termination handler cordons node, the node marked as unschedulable
,
the service controller removes cordoned nodes from LB pools, it can potentially drop in-flight requests, there should be a better process for a node draining,
- taint the node (don't cordon)
- find elb's/target groups and safely de-register the node
- cordon
- drain
relevant issues: https://github.com/kubernetes/autoscaler/issues/1907 https://github.com/kubernetes/kubernetes/issues/65013 https://github.com/kubernetes/kubernetes/issues/44997
and a partial bug fix in 1.19 https://github.com/kubernetes/kubernetes/pull/90823
I'm definitely interested in looking into this more. I've asked @kishorj who works on the aws-load-balancer-controller his thoughts since there needs to be a careful dance between the LB controller and NTH in the draining process. There might be more we can do in that controller without involving NTH as much. But if we need to add this logic to NTH, then I'm not opposed
Hi Brandon, thank you for the quick response. I think an external tool such as NTH is suitable to handle such logic. Even if Kubernetes contributors solve it internally, it won't manage all cases such as draining due to spot interruptions, AZ rebalance, or spot recommendations. The current bug of removing cordon nodes immediately from the load balancer is four years old, if the service controller will be enhanced someday, it can take a lot of time till we can use it. I really hope to see this functionality in NTH.
Linking taint-effect issue, since I think that would mitigate this: https://github.com/aws/aws-node-termination-handler/issues/273
I'm not sure it would really do what we need. The problem is that draining instances from an ELBv2 load balancer is quite slow (usually 4-5 minutes in our experience), and, at least for our nodes, draining the containers is much, much faster.
lifecycle-manager is nice because it polls to make sure the instance is removed from the load balancer before it continues. If I'm reading the taint-effect issue right, it would apply a taint, which could cause an ELB drain to start, but there's not really anything that then waits for the drain to finish before the instances are terminated?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.
Can we get update on this? This would be a cool feature!
we are trying to find solution for the same problem for cluster autoscaler, 1.18 and previous k8 versions used to remove node from LBs with cordon command. We want similar behaviour to retain in 1.19+ k8s, one option is to have cluster autoscaler add below label to worker node or delete the worker node with kubectl delete node
will remove the node from all associated k8 LBs
node.kubernetes.io/exclude-from-external-load-balancers=true
(value doesnt matter, can be true/false or anything)
With the custom termination policy supported by EC2 Auto Scaling, you would specify a Lambda function that can drain the node as well as deregister it from an ELB. This can be a solution until ELB deregistration is natively supported.
Refer to the following links for more details:
- https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-ec2-auto-scaling-now-lets-you-control-which-instances-to-terminate-on-scale-in/
- https://docs.aws.amazon.com/autoscaling/ec2/userguide/lambda-custom-termination-policy.html
Interested to hear from contributors here whether the solution in https://github.com/aws/aws-node-termination-handler/pull/582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers
to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.
Does that solve your problem, or do we need to do additional work to support your use cases?
It does not, in our case. The problem is that all the pods can be drained off the node faster than the node can be deregistered from the load balancer. So something like this happens:
- Instance drain starts. The
node.kubernetes.io/exclude-from-external-load-balancers
label is added, and the load balancer controller starts the deregistration process. - The node termination handler evicts all the pods from the node.
- Once the pods are all evicted, the node is terminated, but it is not yet deregistered from the ELB.
- The instance is terminated, but the ELB continues to send requests to it, until either the deregistration finishes, or the health check trips.
- Finally, the ELB termination finishes.
In our experience, and after working with AWS support, the shortest duration we've been able to get load balancer deregistration down to is 2-3 minutes. Meanwhile we can usually evict all pods in less than 1 minute.
@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.
I use the aws-load-balancer-controller
to provision my load balancers.
Interested to hear from contributors here whether the solution in #582, which adds the label
node.kubernetes.io/exclude-from-external-load-balancers
to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.Does that solve your problem, or do we need to do additional work to support your use cases?
we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers
label to the nodes which removes the nodes from all LBs
in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues
I use the
aws-load-balancer-controller
to provision my load balancers.
Do you run it in IP or Instance mode @tjs-intel ?
@kristofferahl I switched from Instance to IP mode because of general lack of support for node draining by NTH and brupop
@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.
Thanks @tjs-intel! We use IP mode as well so I was wondering if you could possibly explain your setup a bit further as it seems you're not having any issues with dropped requests when using aws-load-balancer-controller and NTH? How do you achieve draining before the target/underlying instance is terminated?
@sarahhodne I think Remove nodes with Cluster Autoscaler taint from LB backends in service controller #105946 fixes the issue
We found a pretty nice way to handle this with Graceful Node Shutdown and preStop hooks on daemonsets. Essentially you set the kubelet parameters (in our case we use karpenter, so we used specified userData in the EC2NodeClass as follows
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
userData: |
#!/bin/bash -xe
echo "$(jq '.shutdownGracePeriod="400s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq '.shutdownGracePeriodCriticalPods="100s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
and then deploy a daemonset on all karpenter nodes with a high terminationGracePeriodSeconds and preStop hook
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: karpenter-termination-waiter
namespace: kube-system
labels:
k8s-app: karpenter-termination-waiter
spec:
selector:
matchLabels:
name: karpenter-termination-waiter
template:
metadata:
labels:
name: karpenter-termination-waiter
spec:
nodeSelector:
karpenter.sh/registered: "true"
containers:
- name: alpine
image: alpine:latest
command: ["sleep", "infinity"]
# wait for the node to be completely deregistered from the load balancer
lifecycle:
preStop:
exec:
command: ["sleep", "300"]
resources:
limits:
cpu: 5m
memory: 10Mi
requests:
cpu: 2m
memory: 5Mi
priorityClassName: high-priority
terminationGracePeriodSeconds: 300
the node is still running aws-node and kube-proxy behind the scene, so it can properly direct requests from the load balancer until it's completely drained. It's important that the gracePeriod and sleep hook is larger than the deregistration delay on the ALB so the node isn't terminated before being fully drained.
@TaylorChristie similar issue with karpenter is being discussed at https://github.com/aws/karpenter-provider-aws/issues/4673
in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but karpenter-termination-waiter
daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?
we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution
@TaylorChristie similar issue with karpenter is being discussed at aws/karpenter-provider-aws#4673
in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but
karpenter-termination-waiter
daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution
Yep, because of the shutdownGracePeriod
set in kubelet, it won't drain any daemonsets like kube-proxy or aws-node (since it is higher priority), so the nodes can still properly forward NodePort traffic to other endpoints. I agree a native karpenter solution would be much better, but in our testing this eliminates the LB 5XX issues we were experiencing.
Interested to hear from contributors here whether the solution in #582, which adds the label
node.kubernetes.io/exclude-from-external-load-balancers
to nodes undergoing cordon-and-drain operations, is sufficient for your needs here. Does that solve your problem, or do we need to do additional work to support your use cases?we updated cluster autoscaler to add
node.kubernetes.io/exclude-from-external-load-balancers
label to the nodes which removes the nodes from all LBsin addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues
@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.
Interested to hear from contributors here whether the solution in #582, which adds the label
node.kubernetes.io/exclude-from-external-load-balancers
to nodes undergoing cordon-and-drain operations, is sufficient for your needs here. Does that solve your problem, or do we need to do additional work to support your use cases?we updated cluster autoscaler to add
node.kubernetes.io/exclude-from-external-load-balancers
label to the nodes which removes the nodes from all LBs in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.
there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller
i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not
Interested to hear from contributors here whether the solution in #582, which adds the label
node.kubernetes.io/exclude-from-external-load-balancers
to nodes undergoing cordon-and-drain operations, is sufficient for your needs here. Does that solve your problem, or do we need to do additional work to support your use cases?we updated cluster autoscaler to add
node.kubernetes.io/exclude-from-external-load-balancers
label to the nodes which removes the nodes from all LBs in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.
there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller
i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not
@infa-ddeore I checked that indeed ALB is removing the node from node group when I set this label of node.kubernetes.io/exclude-from-external-load-balancers
.
Any chance you (or your developers) can publish a PR for that? I think a lot of people would need it.
Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes.
It might be helpful to other people.
What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting
--node-delete-delay-after-taint=60s
You can read more about it here
Explanation:
When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s.
5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API.
ALB is not aware of the fact that the node is going to be terminated by ASG.
Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted.
To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using node-delete-delay-after-taint
parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.
To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.
Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes. It might be helpful to other people. What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting
--node-delete-delay-after-taint=60s
You can read more about it hereExplanation: When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s. 5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API. ALB is not aware of the fact that the node is going to be terminated by ASG. Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted. To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using
node-delete-delay-after-taint
parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.
thanks for the details @oridool, i see aws lb controller understands ToBeDeletedByClusterAutoscaler
and removes the node from LB when this taint is added, also --node-delete-delay-after-taint
option will keep the node alive for specified duration, this is perfect solution
but we use in-tree controller which doesnt understand this taint so the cluster autoscaler customization is needed from our side