bottlerocket-update-operator icon indicating copy to clipboard operation
bottlerocket-update-operator copied to clipboard

Does this project deregister nodes from load balancers?

Open tjs-intel opened this issue 2 years ago • 3 comments

What I'd like: I'm considering adding this to my EKS deployment, but I found a flaw in another project that drains nodes before termination (aws-node-termination-handler) where nodes weren't properly detached from load balancers before being terminated which resulted in failed requests. Does this project cover LB deregistration? I cant find the label node.kubernetes.io/exclude-from-external-load-balancers anywhere in this project.

Any alternatives you've considered: I use a Cloudformation Stack for my ASGs, with a daemonset in the cluster that signals the stack. I might be able to get away with updating the Bottlerocket AMI every week and just let the aws-node-termination-handler deregister nodes (a recently added feature).

tjs-intel avatar Feb 14 '22 17:02 tjs-intel

I'm curious about this too, I'm investigating to see if Bottlerocket would be a good fit for us, and load balancer deregistration is a recurring pain point in our cluster at the moment, we've found that even the node.kubernetes.io/exclude-from-external-load-balancers label isn't sufficient, since the pods drain from our nodes much faster than the instance can deregister from the load balancers (usually a minimum of 3 minutes, sometimes longer depending on configuration).

See also https://github.com/aws/aws-node-termination-handler/issues/316

sarahhodne avatar Feb 14 '22 20:02 sarahhodne

Thanks for raising this. You're right, we don't currently handle this case. But we should! We'll investigate how to best add this to the update operator.

For external load balancers with slower reaction times mentioned by @sarahhodne, I wonder if it would be appropriate for us to allow users to configure a check between exclusion from load balancers and draining to assert that the operator can proceed?

cbgbt avatar Feb 15 '22 17:02 cbgbt

That should work in our case, I think.

If it's easier, even just having a sleep or some flag that ensures the drain takes at least x minutes should be sufficient. We can adjust how long the load balancers take to deregister with configuration, the problem is that the shortest duration we've been able to configure is 3 minutes, and since our pods can often terminate much faster than that, we need something that keeps the node online for a little while longer while the deregistration completes.

sarahhodne avatar Feb 15 '22 17:02 sarahhodne