containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] [bug]: Memory leak in eks-node-monitoring-agent container

Open nathanmcgarvey-modopayments opened this issue 7 months ago • 1 comments

Community Note

  • Please vote on this issue by adding a šŸ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

eks-node-monitoring-agent (at least v1.0.2-eksbuild.2) has a memory leak which triggers hits it's configured limit and causes k8s to restart the container (correctly).

Containers:
  eks-node-monitoring-agent:
    Image:         602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/eks-node-monitoring-agent:v1.0.2-eksbuild.2
    Image ID:      602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/eks-node-monitoring-agent@sha256:f5455808952ec4679e3f1ccf7e743014b76634b0f6358eab7cd288fcc18c73d6
    Port:          <none>
    Host Port:     <none>
    Args:
      --probe-address=:8002
      --metrics-address=:8003
    State:          Running
      Started:      Wed, 30 Apr 2025 08:37:20 -0500
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 14 Apr 2025 11:12:18 -0500
      Finished:     Wed, 30 Apr 2025 08:37:19 -0500
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     250m
      memory:  100Mi
    Requests:
      cpu:     10m
      memory:  30Mi
    Liveness:  http-get http://:8002/healthz delay=0s timeout=1s period=10s #success=1 #failure=3

Grafana timeframe showing memory usage of the pod with the container OOM/restart at the end:

Image

There was a goroutine leak identified affecting up the latest v1.2.0-eksbuild.1 version. A fix should be rolled out in the next release.

ndbaker1 avatar May 09 '25 18:05 ndbaker1

There was a goroutine leak identified affecting up the latest v1.2.0-eksbuild.1 version. A fix should be rolled out in the next release.

Looks like a new version (v1.3.0-eksbuild.1) is released. Is this rolled out with this version?

arvi3411301 avatar Jun 17 '25 09:06 arvi3411301

the latest release is v1.3.0-eksbuild.2, and should address the issue.

ndbaker1 avatar Jun 18 '25 23:06 ndbaker1

We're running v1.3.0-eksbuild.2 currently with the default config and we're still seeing the OOMKilled issues

cilindrox avatar Jul 10 '25 15:07 cilindrox

Just an update that this is still happening across several clusters

cilindrox avatar Aug 07 '25 17:08 cilindrox

thanks, we'll monitor a repro to get more details; is the OOM cadence the same as the original (~15 minutes)?

ndbaker1 avatar Aug 11 '25 06:08 ndbaker1

It's slightly more spaced-out now/not as frequent, but we're seeing the restarts happen every 19-29m now, eg:

eks-node-monitoring-agent-dpddm   0/1     OOMKilled   89 (30m ago)      2d10h
eks-node-monitoring-agent-dpddm   1/1     Running     90 (2s ago)       2d10h
eks-node-monitoring-agent-mzwkk   0/1     OOMKilled   186 (15m ago)     2d20h
eks-node-monitoring-agent-mzwkk   1/1     Running     187 (2s ago)      2d20h

Edit: here's the images for the above:

Image:         602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-node-monitoring-agent:v1.4.0-eksbuild.2
Image ID:      602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-node-monitoring-agent@sha256:1e6868782a167ba923e8d7135d44fb24c8ee4029229bfa7d9535c08a9c69db17

cilindrox avatar Aug 11 '25 11:08 cilindrox

@cilindrox are you seeing this only on a specific set of instances types?

ndbaker1 avatar Aug 13 '25 18:08 ndbaker1

@ndbaker1 instance types are c6a.32xlarge and t3a.xlarge (x86) and c6g.8xlarge (ARM), all on-demand - I've got two pods with 7h+ uptime on a c6a.12xlarge and no restarts, which is a bit baffling though, but they're both on dedicated ingress-controller nodes, so maybe there's not enough mem pressure on there.

Other pods sitting on the same node types (c6a.12xlarge) do get OOMKilled, but it's less often 1x in the past 7h+ so far.

k get po -l app.kubernetes.io/name=eks-node-monitoring-agent
NAME                              READY   STATUS    RESTARTS          AGE
eks-node-monitoring-agent-2k95c   1/1     Running   1 (143m ago)      7h23m
eks-node-monitoring-agent-4b2j6   1/1     Running   261 (8m42s ago)   5d5h
eks-node-monitoring-agent-9rf9f   1/1     Running   13 (14m ago)      11h
eks-node-monitoring-agent-d67q4   1/1     Running   1 (11h ago)       5d5h
eks-node-monitoring-agent-dgmhl   1/1     Running   31 (44m ago)      22h
eks-node-monitoring-agent-dpddm   1/1     Running   194 (24m ago)     4d19h
eks-node-monitoring-agent-f67zj   1/1     Running   1 (11h ago)       5d5h
eks-node-monitoring-agent-fg45c   1/1     Running   183 (24m ago)     4d19h
eks-node-monitoring-agent-hccfl   1/1     Running   1 (11h ago)       5d5h
eks-node-monitoring-agent-hsmsv   1/1     Running   32 (58m ago)      22h
eks-node-monitoring-agent-jqldr   1/1     Running   265 (15m ago)     5d5h
eks-node-monitoring-agent-jxvth   1/1     Running   48 (15m ago)      5d5h
eks-node-monitoring-agent-kf8mw   1/1     Running   0                 7h23m
eks-node-monitoring-agent-kprks   1/1     Running   4 (8m8s ago)      128m
eks-node-monitoring-agent-lhbdf   1/1     Running   238 (18m ago)     5d5h
eks-node-monitoring-agent-lhbmh   1/1     Running   22 (12m ago)      11h
eks-node-monitoring-agent-nrj96   1/1     Running   18 (123m ago)     2d9h
eks-node-monitoring-agent-p66z5   1/1     Running   0                 7h27m
eks-node-monitoring-agent-pj2vx   1/1     Running   49 (28m ago)      2d9h
eks-node-monitoring-agent-qg7c5   1/1     Running   3 (39m ago)       3h10m
eks-node-monitoring-agent-rwdvb   1/1     Running   60 (28m ago)      4d3h
eks-node-monitoring-agent-t6scf   1/1     Running   252 (14m ago)     5d5h
eks-node-monitoring-agent-tr2rp   1/1     Running   261 (13m ago)     5d5h
eks-node-monitoring-agent-x2bbg   1/1     Running   3 (8m4s ago)      7h23m

cilindrox avatar Aug 13 '25 20:08 cilindrox

Just another quick update that I've upgraded the node's AMIs to ami-032e05156b62442b1 (arm64) and ami-00848331bb38314ea (x86_64) and I'm still seeing these OOMKills

cilindrox avatar Aug 15 '25 14:08 cilindrox

Tested with v1.4.0-eksbuild.2, this issue still exists. Straight to OOMKilled even though higher memory limits are configured.

ljhowie avatar Aug 21 '25 07:08 ljhowie

But something quite bizarre in my case. We have a few clusters but only part of them having this issue. Some clusters are OK even with the default configuration.

ljhowie avatar Aug 21 '25 07:08 ljhowie

Is there support for pprof (https://pkg.go.dev/runtime/pprof), remote even (https://pkg.go.dev/net/http/pprof), or some other means to create heap dumps in the eks-node-monitoring-agent? That would quite quickly allow to find what is using all that memory ...

frittentheke avatar Aug 21 '25 11:08 frittentheke

We're following a similar trail, but haven't encountered conditions where it seemingly spikes right away as reported here. Is there anything with notable disk I/O running on failing nodes?

You can add the following flag via EKS Addon configuration to enable the pprof endpoint on the agent. additionalArgs was added recently in v1.4.0-eksbuild.2

{
    "monitoringAgent": {
        "additionalArgs": ["--pprof-address=:8082"]
    }
}

ndbaker1 avatar Aug 21 '25 18:08 ndbaker1

@ndbaker1 the I/O I'm guessing it's relative - because we're seeing these on dedicated ingress-nginx nodes on dev environments with barely any traffic (~30 reqs/1m) and like @ljhowie said, no events on nodes that have really hot pods with lots of logging/connections etc.

Edit: the other workload that might be causing i/o pressure, are falcosecurity/falco pods šŸ¤”

cilindrox avatar Aug 22 '25 10:08 cilindrox

I’m having the same issue here. I checked, and the node has no memory pressure and no high disk I/O. The pod just reached the memory limit and got killed.

Willis0826 avatar Sep 07 '25 18:09 Willis0826

I'm seeing similar memory usage with the latest 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-node-monitoring-agent:v1.4.0-eksbuild.2 that suggests a memory leak. Is any maintainer looking at this issue? How can we get in touch with them?

I also created a bug report in https://github.com/aws/eks-node-monitoring-agent/issues/17.

Image

davidxia avatar Sep 08 '25 14:09 davidxia

Any update on this? Still the pods are not running with OOMkilled status.

aravind-jl avatar Nov 04 '25 10:11 aravind-jl

@aravind-jl There looks to be some updates in https://github.com/aws/eks-node-monitoring-agent/issues/17#issuecomment-3483646668

kunhwiko avatar Nov 04 '25 18:11 kunhwiko