aws-node-termination-handler icon indicating copy to clipboard operation
aws-node-termination-handler copied to clipboard

NTH should issue lifecycle heartbeats

Open jrsherry opened this issue 3 years ago • 13 comments

I've been using the NTH in queue processor mode. This implementation uses a lifecycle hook associated with the node instance to trigger the NTH to cordon/drain. Lifecycle hooks support two timeouts; the global timeout (max 48hrs) and the heartbeat timeout (max 7200 seconds). https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_LifecycleHook.html

This means that if the NTH doesn't issue lifecycle heartbeats during the draining process, the node will be terminated (assuming CONTINUE on timeout vice ABANDON) within 7200 seconds (whatever the hook's heartbeat timeout is configured to).

This is problematic if you've got termination grace periods that can exceed 7200 seconds. The node will be terminated before the pod can safely evict.

If the NTH was issuing lifecycle heartbeats during the node drain, then this would effectively support grace periods that extend to the 48 hour global timeout. https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html

jrsherry avatar Oct 06 '21 11:10 jrsherry

Interesting, I hadn't expected nodes taking hours to drain. We can look into adding the heartbeat to cover the long draining case. Thanks for reporting!

bwagner5 avatar Oct 08 '21 02:10 bwagner5

I would ask that this be made configurable so that installs that intentionally use the heartbeat timeout as the limiting factor (which is how NTH works now) can still do so. For cases where NTH can't reliably determine if progress is being made (which would probably be most of the time?), the heartbeat would be counterproductive if you want to timeout before the global timeout.

gabegorelick avatar Oct 28 '21 19:10 gabegorelick

We'd be interested in this as well, we have some applications which we run with a graceful termination threshold of upto 3 hours. In most cases the pods gracefully terminate much before the 3 hours but there would be cases where it can take close to 3 hours for the pods to terminate gracefully.

For cases where NTH can't reliably determine if progress is being made (which would probably be most of the time?), the heartbeat would be counterproductive if you want to timeout before the global timeout.

Yes, a configurable max period for sending heartbeats would be helpful. So NTH can probably be configured to send heartbeats for a given period (3 hours in our case) and the lifecycle heartbeat timeout can be set to 5 min or so, this way the node would be terminated anyway once the heartbeats stop.

varkey avatar Nov 16 '21 15:11 varkey

It looks like this issue has a good amount of interest. We would absolutely be open to accepting a PR for this, but right now we are focusing on the next version of NTH (V2). In V2 we hope to eliminate the need for adding so many additional configurations and solve a number of other issues on this repository.

jillmon avatar Nov 17 '21 19:11 jillmon

right now we are focusing on the next version of NTH (V2)

Is there any documentation on V2, and specifically for this issue how it would handle heartbeats?

gabegorelick avatar Nov 17 '21 19:11 gabegorelick

@snay2 @cjerad is this implemented in v2?

stevehipwell avatar Nov 03 '22 17:11 stevehipwell

@stevehipwell Currently, v2 does not issue heartbeats.

cjerad avatar Nov 03 '22 18:11 cjerad

@cjerad is that a conscious decision or is it something you'd like to do if you had to resource to implement it?

stevehipwell avatar Nov 03 '22 18:11 stevehipwell

@stevehipwell It just hasn't been investigated yet.

cjerad avatar Nov 16 '22 20:11 cjerad

@cjerad are you looking for contributions?

stevehipwell avatar Nov 16 '22 21:11 stevehipwell

bump? it's been years since there was a mention of V2

0x91 avatar Nov 08 '23 12:11 0x91

any update on this? 🙏🏽 This is very important feature required for running stateful workloads like kafka clusters where graceful rotation of nodes is critical and also quite slow.

riuvshyn avatar Nov 21 '23 10:11 riuvshyn

Interesting, I hadn't expected nodes taking hours to drain. We can look into adding the heartbeat to cover the long draining case. Thanks for reporting!

@bwagner5 This problem affects not only pods that are taking long time to terminate but also this affects any workload where many pods involved with a podAntiAffinity + pdb limiting termination multiple pods at once. So with these constraints if all nodes where such pods are running falling under 2h timeout due to PDB not allowing terminate many pods at the same time, so it really depends on how quickly pod is being replaced and allows next node to proceed with eviction. Sometimes when for some reason start up or termination might take longer than usual or hitting AWS capacity issues so new nodes can't be scheduled there is a hight chance to hit 2h time out by significant amount of pending termination nodes and they will be killed without draining which will cause down time for the workloads.

This is pretty serious limitation which has to be considered before using NTH in large clusters.

riuvshyn avatar Nov 23 '23 18:11 riuvshyn