aws-node-termination-handler
aws-node-termination-handler copied to clipboard
Dropping Event Error because of event signal latency
Describe the bug
Sometimes there are many 504 logs because of spot nodes shut down ungracefully before nth drain it. I inspect the logs and the process time only 40s between nth get terminate event and WRN dropping event error="node: 'i-0e39b037e55532xxx' in state 'terminated'" log
Steps to reproduce nth sqs mode asg lifecycle heartbeat timeout 300s with CONTINUE
Expected outcome nth must catch spot terminate events before 2mins and drain gracefully
Application Logs
2024/06/11 19:51:13 INF Requesting instance drain event-id=asg-lifecycle-term-xxx instance-id=i-0446be825a53dfxxx kind=ASG_LIFECYCLE node-name=ip-192-168-64-xx.eu-west-1.compute.internal provider-id=aws:///eu-west-1a/i-0446be825a53dfxxx
2024/06/11 19:51:52 WRN dropping event error="node: 'i-0446be825a53dfxxx' in state 'terminated'"
Diff is only 39s. I got many similar log related 504 errors
Question Does aws not wait at least 2 minutes to terminate spot nodes? -or- Nth is unable to capture terminate events in a timely manner?
Environment
- NTH App Version: v1.21.0
- NTH Mode (IMDS/Queue processor): Queue
- OS/Arch: EC2-amazon linux (EKS)
- Kubernetes version: v1.29
- Installation method: Helm