aws-node-termination-handler
                                
                                
                                
                                    aws-node-termination-handler copied to clipboard
                            
                            
                            
                        Increasing memory usage
Describe the bug We have deployed v1.16.3 of the node termination handler. I noticed that the memory usage is increasing over certain period of time and eventually it reaches the pod memory limit and is OOMKilled. Is there a memory leak somewhere?

Application Logs Following are the logs for the aws-node-termination-handler pod. Logs do not have anything erroneous:
2022/07/28 07:15:01 INF Starting to serve handler /metrics, port 9092
2022/07/28 07:15:01 INF Starting to serve handler /healthz, port 8080
2022/07/28 07:15:01 INF Startup Metadata Retrieved metadata={"accountId":"xxxx","availabilityZone":"us-west-2b","instanceId":"i-xxxx","instanceLifeCycle":"on-demand","instanceType":"c6i.4xlarge","localHostname":"xxxx.us-west-2.compute.internal","privateIp":"x.x.x.x","publicHostname":"","publicIp":"","region":"us-west-2"}
2022/07/28 07:15:01 INF aws-node-termination-handler arguments:
dry-run: false,
node-name: xxxx.us-west-2.compute.internal,
pod-name: aws-node-termination-handler-b56bf578b-79x5m,
metadata-url: http://abcd,
kubernetes-service-host: x.x.x.X,
kubernetes-service-port: 443,
delete-local-data: true,
ignore-daemon-sets: true,
pod-termination-grace-period: -1,
node-termination-grace-period: 120,
enable-scheduled-event-draining: false,
enable-spot-interruption-draining: false,
enable-sqs-termination-draining: true,
enable-rebalance-monitoring: false,
enable-rebalance-draining: false,
metadata-tries: 3,
cordon-only: false,
taint-node: true,
taint-effect: NoSchedule,
exclude-from-load-balancers: false,
json-logging: false,
log-level: info,
webhook-proxy: ,
webhook-headers: 
- Kubernetes version: v1.21
 
@abhijitawachar thanks for bringing this to our attention. I noticed that your metrics show a similar pattern of memory usage in period between 5/4/22 - 5/12/22. We released v1.16.3 on 5/11/22, so it looks like this memory behavior may have existed prior to v1.16.3. To help us track down the issue, would you be able to tell us how far back you can see this behavior in your metrics, and what NTH version you had deployed?
@AustinSiu thanks for looking into this. We started using NTH from 5/3/22 with version v1.16.1 and updated it to v1.16.3 on 5/13/22 . From screenshot below, we can say that we are seeing this issue from v1.16.1 , as we recently started using NTH, not really sure how far back this behaviour is seen.

@snay2 / @AustinSiu did you get a chance to look at this.
@abhijitawachar Yes, I am actively investigating this week. So far, I have been able to reproduce this behavior in a sandboxed environment. Currently deep diving possible causes with a memory profiler.
Hey @snay2 , did you get a chance to look into this
My initial research from 2022-08-29 was inconclusive, and so far I haven't found a root cause. For now, I'm pausing active work on this issue. Sorry I don't have better news at this time. :(
However, we have long-term plans to build a more robust simulation system in our CI/CD process that will help us monitor for memory leaks. To help with that, we want to write test drivers that simulate real-world workloads. Could you tell me a bit more about how your cluster is set up (Queue Processor or IMDS; how frequently does NTH process messages, and what kinds; etc.)? From your graphs, it looks like every few weeks, the NTH pod gets OOM killed and restarted; is that correct? Your graph is measuring the memory usage of the NTH pod, yes? Is this behavior causing any negative performance impact to your cluster?
That kind of information will help us design a realistic simulation system and hopefully track down this leak.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.
This issue was closed because it has become stale with no activity.