aws-node-termination-handler Dropping Event Error because of event signal latency

Dropping Event Error because of event signal latency

Open ibalat opened this issue 1 year ago • 0 comments

Describe the bug Sometimes there are many 504 logs because of spot nodes shut down ungracefully before nth drain it. I inspect the logs and the process time only 40s between nth get terminate event and WRN dropping event error="node: 'i-0e39b037e55532xxx' in state 'terminated'" log

Steps to reproduce nth sqs mode asg lifecycle heartbeat timeout 300s with CONTINUE

Expected outcome nth must catch spot terminate events before 2mins and drain gracefully

Application Logs 2024/06/11 19:51:13 INF Requesting instance drain event-id=asg-lifecycle-term-xxx instance-id=i-0446be825a53dfxxx kind=ASG_LIFECYCLE node-name=ip-192-168-64-xx.eu-west-1.compute.internal provider-id=aws:///eu-west-1a/i-0446be825a53dfxxx

2024/06/11 19:51:52 WRN dropping event error="node: 'i-0446be825a53dfxxx' in state 'terminated'"

Diff is only 39s. I got many similar log related 504 errors

Question Does aws not wait at least 2 minutes to terminate spot nodes? -or- Nth is unable to capture terminate events in a timely manner?

Environment

NTH App Version: v1.21.0
NTH Mode (IMDS/Queue processor): Queue
OS/Arch: EC2-amazon linux (EKS)
Kubernetes version: v1.29
Installation method: Helm

Jun 12 '24 06:06 ibalat

aws-node-termination-handler aws-node-termination-handler copied to clipboard

Dropping Event Error because of event signal latency

aws-node-termination-handler
aws-node-termination-handler copied to clipboard