aws-node-termination-handler icon indicating copy to clipboard operation
aws-node-termination-handler copied to clipboard

Dropping Event Error because of event signal latency

Open ibalat opened this issue 1 year ago • 0 comments

Describe the bug Sometimes there are many 504 logs because of spot nodes shut down ungracefully before nth drain it. I inspect the logs and the process time only 40s between nth get terminate event and WRN dropping event error="node: 'i-0e39b037e55532xxx' in state 'terminated'" log

Steps to reproduce nth sqs mode asg lifecycle heartbeat timeout 300s with CONTINUE

Expected outcome nth must catch spot terminate events before 2mins and drain gracefully

Application Logs 2024/06/11 19:51:13 INF Requesting instance drain event-id=asg-lifecycle-term-xxx instance-id=i-0446be825a53dfxxx kind=ASG_LIFECYCLE node-name=ip-192-168-64-xx.eu-west-1.compute.internal provider-id=aws:///eu-west-1a/i-0446be825a53dfxxx

2024/06/11 19:51:52 WRN dropping event error="node: 'i-0446be825a53dfxxx' in state 'terminated'"

Diff is only 39s. I got many similar log related 504 errors

Question Does aws not wait at least 2 minutes to terminate spot nodes? -or- Nth is unable to capture terminate events in a timely manner?

Environment

  • NTH App Version: v1.21.0
  • NTH Mode (IMDS/Queue processor): Queue
  • OS/Arch: EC2-amazon linux (EKS)
  • Kubernetes version: v1.29
  • Installation method: Helm

ibalat avatar Jun 12 '24 06:06 ibalat