aws-node-termination-handler Increasing memory usage

Describe the bug We have deployed v1.16.3 of the node termination handler. I noticed that the memory usage is increasing over certain period of time and eventually it reaches the pod memory limit and is OOMKilled. Is there a memory leak somewhere?

nth-memory-usage

Application Logs Following are the logs for the aws-node-termination-handler pod. Logs do not have anything erroneous:

2022/07/28 07:15:01 INF Starting to serve handler /metrics, port 9092 2022/07/28 07:15:01 INF Starting to serve handler /healthz, port 8080 2022/07/28 07:15:01 INF Startup Metadata Retrieved metadata={"accountId":"xxxx","availabilityZone":"us-west-2b","instanceId":"i-xxxx","instanceLifeCycle":"on-demand","instanceType":"c6i.4xlarge","localHostname":"xxxx.us-west-2.compute.internal","privateIp":"x.x.x.x","publicHostname":"","publicIp":"","region":"us-west-2"} 2022/07/28 07:15:01 INF aws-node-termination-handler arguments: dry-run: false, node-name: xxxx.us-west-2.compute.internal, pod-name: aws-node-termination-handler-b56bf578b-79x5m, metadata-url: http://abcd, kubernetes-service-host: x.x.x.X, kubernetes-service-port: 443, delete-local-data: true, ignore-daemon-sets: true, pod-termination-grace-period: -1, node-termination-grace-period: 120, enable-scheduled-event-draining: false, enable-spot-interruption-draining: false, enable-sqs-termination-draining: true, enable-rebalance-monitoring: false, enable-rebalance-draining: false, metadata-tries: 3, cordon-only: false, taint-node: true, taint-effect: NoSchedule, exclude-from-load-balancers: false, json-logging: false, log-level: info, webhook-proxy: , webhook-headers: , webhook-url: , webhook-template: , uptime-from-file: , enable-prometheus-server: true, prometheus-server-port: 9092, emit-kubernetes-events: true, kubernetes-events-extra-annotations: , aws-region: us-west-2, queue-url: https://xxxx, check-asg-tag-before-draining: false, managed-asg-tag: aws-node-termination-handler/managed, use-provider-id: false, aws-endpoint: , 2022/07/28 07:15:01 INF Started watching for interruption events 2022/07/28 07:15:01 INF Kubernetes AWS Node Termination Handler has started successfully! 2022/07/28 07:15:01 INF Started watching for event cancellations 2022/07/28 07:15:01 INF Started monitoring for events event_type=SQS_TERMINATE 2022/07/28 07:20:12 INF Adding new event to the event store event={"AutoScalingGroupName":"xxxx","Description":"EC2 State Change event received. Instance i-xxxx went into shutting-down at 2022-07-28 07:20:11 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"ec2-state-change-event-xxxx","InProgress":false,"InstanceID":"i-xxxx","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"xxxx.us-west-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-west-2c/xxxx","StartTime":"2022-07-28T07:20:11Z","State":""}

Kubernetes version: v1.21

Aug 03 '22 10:08 abhijitawachar

@abhijitawachar thanks for bringing this to our attention. I noticed that your metrics show a similar pattern of memory usage in period between 5/4/22 - 5/12/22. We released v1.16.3 on 5/11/22, so it looks like this memory behavior may have existed prior to v1.16.3. To help us track down the issue, would you be able to tell us how far back you can see this behavior in your metrics, and what NTH version you had deployed?

Aug 07 '22 22:08 AustinSiu

@AustinSiu thanks for looking into this. We started using NTH from 5/3/22 with version v1.16.1 and updated it to v1.16.3 on 5/13/22 . From screenshot below, we can say that we are seeing this issue from v1.16.1 , as we recently started using NTH, not really sure how far back this behaviour is seen.

nth-mem-use

Aug 08 '22 07:08 abhijitawachar

@snay2 / @AustinSiu did you get a chance to look at this.

Aug 29 '22 07:08 abhijitawachar

@abhijitawachar Yes, I am actively investigating this week. So far, I have been able to reproduce this behavior in a sandboxed environment. Currently deep diving possible causes with a memory profiler.

Aug 29 '22 15:08 snay2

Hey @snay2 , did you get a chance to look into this

Sep 08 '22 16:09 abhijitawachar

My initial research from 2022-08-29 was inconclusive, and so far I haven't found a root cause. For now, I'm pausing active work on this issue. Sorry I don't have better news at this time. :(

However, we have long-term plans to build a more robust simulation system in our CI/CD process that will help us monitor for memory leaks. To help with that, we want to write test drivers that simulate real-world workloads. Could you tell me a bit more about how your cluster is set up (Queue Processor or IMDS; how frequently does NTH process messages, and what kinds; etc.)? From your graphs, it looks like every few weeks, the NTH pod gets OOM killed and restarted; is that correct? Your graph is measuring the memory usage of the NTH pod, yes? Is this behavior causing any negative performance impact to your cluster?

That kind of information will help us design a realistic simulation system and hopefully track down this leak.

Sep 09 '22 22:09 snay2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

Feb 19 '23 17:02 github-actions[bot]

This issue was closed because it has become stale with no activity.

Feb 24 '23 17:02 github-actions[bot]

aws-node-termination-handler aws-node-termination-handler copied to clipboard

Increasing memory usage

aws-node-termination-handler
aws-node-termination-handler copied to clipboard