containers-roadmap
containers-roadmap copied to clipboard
[EKS] [bug]: Memory leak in eks-node-monitoring-agent container
Community Note
- Please vote on this issue by adding a š reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
eks-node-monitoring-agent (at least v1.0.2-eksbuild.2) has a memory leak which triggers hits it's configured limit and causes k8s to restart the container (correctly).
Containers:
eks-node-monitoring-agent:
Image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/eks-node-monitoring-agent:v1.0.2-eksbuild.2
Image ID: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/eks-node-monitoring-agent@sha256:f5455808952ec4679e3f1ccf7e743014b76634b0f6358eab7cd288fcc18c73d6
Port: <none>
Host Port: <none>
Args:
--probe-address=:8002
--metrics-address=:8003
State: Running
Started: Wed, 30 Apr 2025 08:37:20 -0500
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 14 Apr 2025 11:12:18 -0500
Finished: Wed, 30 Apr 2025 08:37:19 -0500
Ready: True
Restart Count: 1
Limits:
cpu: 250m
memory: 100Mi
Requests:
cpu: 10m
memory: 30Mi
Liveness: http-get http://:8002/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Grafana timeframe showing memory usage of the pod with the container OOM/restart at the end:
There was a goroutine leak identified affecting up the latest v1.2.0-eksbuild.1 version. A fix should be rolled out in the next release.
There was a goroutine leak identified affecting up the latest
v1.2.0-eksbuild.1version. A fix should be rolled out in the next release.
Looks like a new version (v1.3.0-eksbuild.1) is released. Is this rolled out with this version?
the latest release is v1.3.0-eksbuild.2, and should address the issue.
We're running v1.3.0-eksbuild.2 currently with the default config and we're still seeing the OOMKilled issues
Just an update that this is still happening across several clusters
thanks, we'll monitor a repro to get more details; is the OOM cadence the same as the original (~15 minutes)?
It's slightly more spaced-out now/not as frequent, but we're seeing the restarts happen every 19-29m now, eg:
eks-node-monitoring-agent-dpddm 0/1 OOMKilled 89 (30m ago) 2d10h
eks-node-monitoring-agent-dpddm 1/1 Running 90 (2s ago) 2d10h
eks-node-monitoring-agent-mzwkk 0/1 OOMKilled 186 (15m ago) 2d20h
eks-node-monitoring-agent-mzwkk 1/1 Running 187 (2s ago) 2d20h
Edit: here's the images for the above:
Image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-node-monitoring-agent:v1.4.0-eksbuild.2
Image ID: 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-node-monitoring-agent@sha256:1e6868782a167ba923e8d7135d44fb24c8ee4029229bfa7d9535c08a9c69db17
@cilindrox are you seeing this only on a specific set of instances types?
@ndbaker1 instance types are c6a.32xlarge and t3a.xlarge (x86) and c6g.8xlarge (ARM), all on-demand - I've got two pods with 7h+ uptime on a c6a.12xlarge and no restarts, which is a bit baffling though, but they're both on dedicated ingress-controller nodes, so maybe there's not enough mem pressure on there.
Other pods sitting on the same node types (c6a.12xlarge) do get OOMKilled, but it's less often 1x in the past 7h+ so far.
k get po -l app.kubernetes.io/name=eks-node-monitoring-agent
NAME READY STATUS RESTARTS AGE
eks-node-monitoring-agent-2k95c 1/1 Running 1 (143m ago) 7h23m
eks-node-monitoring-agent-4b2j6 1/1 Running 261 (8m42s ago) 5d5h
eks-node-monitoring-agent-9rf9f 1/1 Running 13 (14m ago) 11h
eks-node-monitoring-agent-d67q4 1/1 Running 1 (11h ago) 5d5h
eks-node-monitoring-agent-dgmhl 1/1 Running 31 (44m ago) 22h
eks-node-monitoring-agent-dpddm 1/1 Running 194 (24m ago) 4d19h
eks-node-monitoring-agent-f67zj 1/1 Running 1 (11h ago) 5d5h
eks-node-monitoring-agent-fg45c 1/1 Running 183 (24m ago) 4d19h
eks-node-monitoring-agent-hccfl 1/1 Running 1 (11h ago) 5d5h
eks-node-monitoring-agent-hsmsv 1/1 Running 32 (58m ago) 22h
eks-node-monitoring-agent-jqldr 1/1 Running 265 (15m ago) 5d5h
eks-node-monitoring-agent-jxvth 1/1 Running 48 (15m ago) 5d5h
eks-node-monitoring-agent-kf8mw 1/1 Running 0 7h23m
eks-node-monitoring-agent-kprks 1/1 Running 4 (8m8s ago) 128m
eks-node-monitoring-agent-lhbdf 1/1 Running 238 (18m ago) 5d5h
eks-node-monitoring-agent-lhbmh 1/1 Running 22 (12m ago) 11h
eks-node-monitoring-agent-nrj96 1/1 Running 18 (123m ago) 2d9h
eks-node-monitoring-agent-p66z5 1/1 Running 0 7h27m
eks-node-monitoring-agent-pj2vx 1/1 Running 49 (28m ago) 2d9h
eks-node-monitoring-agent-qg7c5 1/1 Running 3 (39m ago) 3h10m
eks-node-monitoring-agent-rwdvb 1/1 Running 60 (28m ago) 4d3h
eks-node-monitoring-agent-t6scf 1/1 Running 252 (14m ago) 5d5h
eks-node-monitoring-agent-tr2rp 1/1 Running 261 (13m ago) 5d5h
eks-node-monitoring-agent-x2bbg 1/1 Running 3 (8m4s ago) 7h23m
Just another quick update that I've upgraded the node's AMIs to ami-032e05156b62442b1 (arm64) and ami-00848331bb38314ea (x86_64) and I'm still seeing these OOMKills
Tested with v1.4.0-eksbuild.2, this issue still exists. Straight to OOMKilled even though higher memory limits are configured.
But something quite bizarre in my case. We have a few clusters but only part of them having this issue. Some clusters are OK even with the default configuration.
Is there support for pprof (https://pkg.go.dev/runtime/pprof), remote even (https://pkg.go.dev/net/http/pprof), or some other means to create heap dumps in the eks-node-monitoring-agent? That would quite quickly allow to find what is using all that memory ...
We're following a similar trail, but haven't encountered conditions where it seemingly spikes right away as reported here. Is there anything with notable disk I/O running on failing nodes?
You can add the following flag via EKS Addon configuration to enable the pprof endpoint on the agent. additionalArgs was added recently in v1.4.0-eksbuild.2
{
"monitoringAgent": {
"additionalArgs": ["--pprof-address=:8082"]
}
}
@ndbaker1 the I/O I'm guessing it's relative - because we're seeing these on dedicated ingress-nginx nodes on dev environments with barely any traffic (~30 reqs/1m) and like @ljhowie said, no events on nodes that have really hot pods with lots of logging/connections etc.
Edit: the other workload that might be causing i/o pressure, are falcosecurity/falco pods š¤
Iām having the same issue here. I checked, and the node has no memory pressure and no high disk I/O. The pod just reached the memory limit and got killed.
I'm seeing similar memory usage with the latest 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-node-monitoring-agent:v1.4.0-eksbuild.2 that suggests a memory leak. Is any maintainer looking at this issue? How can we get in touch with them?
I also created a bug report in https://github.com/aws/eks-node-monitoring-agent/issues/17.
Any update on this? Still the pods are not running with OOMkilled status.
@aravind-jl There looks to be some updates in https://github.com/aws/eks-node-monitoring-agent/issues/17#issuecomment-3483646668