coroot-node-agent icon indicating copy to clipboard operation
coroot-node-agent copied to clipboard

Memory leak in node-agent

Open Gakhramanzode opened this issue 7 months ago • 9 comments

I use Coroot in Kubernetes cluster and notice some problem resources utilization in one of the coroot-node-agent pods. The pod coroot-node-agent-szq9x, which runs on the monitoring-node, shows very high CPU and memory usage. Eventually, the pod gets Out of Memory. Other agents in the cluster work normally. Below screenshot, we see high CPU and RAM utilization. We cath Out of Memory. Image Image

kubectl describe po coroot-node-agent-szq9x
...
    State:          Running
      Started:      Wed, 16 Apr 2025 11:37:38 +0300
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 16 Apr 2025 11:19:23 +0300
      Finished:     Wed, 16 Apr 2025 11:37:37 +0300
    Ready:          True
    Restart Count:  29
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi
...

Pod coroot-node-agent-szq9x works on monitoring-node:

kubectl describe kube-main-4a
...
  Namespace                   Name                                                     CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                                     ------------  ----------   ---------------  -------------  ---
  calico-system               calico-node-v95hp                                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         178d
  calico-system               csi-node-driver-zcjsv                                    0 (0%)        0 (0%)       0 (0%)           0 (0%)         178d
  coroot                      coroot-cluster-agent-5db66bb4f8-c9hkg                    2 (25%)       2 (25%)      8Gi (26%)        8Gi (26%)      44h
  coroot                      coroot-coroot-0                                          0 (0%)        0 (0%)       10Gi (33%)       10Gi (33%)     4d18h
  coroot                      coroot-node-agent-szq9x                                  1 (12%)       1 (12%)      1Gi (3%)         1Gi (3%)       17h
  coroot                      coroot-operator-6fbc557687-kgwbq                         100m (1%)     500m (6%)    64Mi (0%)        1Gi (3%)       4d22h
  kube-state-metrics          kube-state-metrics-7fc9d8bd9-nblzm                       0 (0%)        0 (0%)       0 (0%)           0 (0%)         44h
  kube-system                 kube-proxy-9mfwr                                         0 (0%)        0 (0%)       0 (0%)           0 (0%)         621d
  kube-system                 node-local-dns-fvxfm                                     100m (1%)     0 (0%)       50Mi (0%)        0 (0%)         607d
  monitoring-system           vmagent-vmagent-5bcddb78f7-xxk87                         2100m (26%)   4100m (51%)  4121Mi (13%)     4121Mi (13%)   64d
  monitoring-system           vmalert-vmalert-5fbf78cb8b-z9sgr                         150m (1%)     300m (3%)    225Mi (0%)       525Mi (1%)     44h
  monitoring-system           vmoperator-victoria-metrics-operator-67ffc558c8-44n7c    0 (0%)        0 (0%)       0 (0%)           0 (0%)         44h
  monitoring-system           x509-certificate-exporter-nodes-65r9f                    10m (0%)      100m (1%)    20Mi (0%)        40Mi (0%)      517d
  observability               jaeger-agent-daemonset-qs8h5                             0 (0%)        0 (0%)       0 (0%)           0 (0%)         608d
  observability               jaeger-collector-858fb8d797-crgvf                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         44h
  observability               jaeger-collector-858fb8d797-g4rgs                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         44h
  observability               jaeger-operator-df86f5749-8tk9j                          0 (0%)        0 (0%)       0 (0%)           0 (0%)         44h
  observability               jaeger-query-665966b4f7-c79wr                            0 (0%)        0 (0%)       0 (0%)           0 (0%)         44h
  rook-ceph                   csi-cephfsplugin-ttnwm                                   300m (3%)     600m (7%)    640Mi (2%)       1280Mi (4%)    514d
  rook-ceph                   csi-rbdplugin-lb684                                      300m (3%)     600m (7%)    640Mi (2%)       1280Mi (4%)    514d
  tigera-operator             tigera-operator-6f488755c5-5spgt                         0 (0%)        0 (0%)       0 (0%)           0 (0%)         178d
...

Attache pprof file: mem_profile_kube-main-4a_new_new_2.tgz mem_profile_kube-main-4a_new_new.tgz mem_profile_kube-main-4a_new.tgz mem_profile_kube-main-4a.tgz

Environment:

  • Coroot version: 1.9.10
  • Coroot Node Agent: 1.23.14
  • Kubernetes version: v1.25.12

Gakhramanzode avatar Apr 16 '25 09:04 Gakhramanzode

@def @apetruhin hello all, I attach a new example

Image

mem_profile_kube-infra-14a.tgz

Environment:

  • Coroot version: 1.11.4
  • Coroot Node Agent: 1.23.23

Gakhramanzode avatar May 21 '25 19:05 Gakhramanzode

I have the same, but I limited the agent at 12Gi and still I see this issue, OOMKilled. I tried to remove the limit and have seen 45Gi memory usage.

Coroot version 1.11.4 Agent: 1.23.25

Image

stepanselyuk avatar May 23 '25 15:05 stepanselyuk

@stepanselyuk can you share a memory profile?

def avatar May 23 '25 16:05 def

@def Hello!

I have updated coroot-node-agent since 1.23.26 to 1.24.0 version. Your fix looks not bad.

Image

Gakhramanzode avatar May 28 '25 13:05 Gakhramanzode

@def @apetruhin sadly, we are seeing again CPU throttling and high RAM utilization on coroot-node-agent.

Image Image Image Image

mem_profile_kube-infra-14a.tgz

NAME                                    READY   STATUS    RESTARTS     AGE
coroot-cluster-agent-5549dc6f5d-c5sh5   2/2     Running   0            19h
coroot-coroot-0                         1/1     Running   0            19h
coroot-node-agent-7mcz9                 1/1     Running   0            19h
coroot-node-agent-84t9l                 1/1     Running   0            19h
coroot-node-agent-8n7tl                 1/1     Running   1 (9h ago)   19h
coroot-node-agent-bmzn6                 1/1     Running   0            19h
coroot-node-agent-ckqnt                 1/1     Running   0            19h
coroot-node-agent-cnn9m                 1/1     Running   0            19h
coroot-node-agent-crl2v                 1/1     Running   0            19h
coroot-node-agent-fgg78                 1/1     Running   0            19h
coroot-node-agent-gwtr2                 1/1     Running   0            19h
coroot-node-agent-n8mkj                 1/1     Running   0            19h
coroot-node-agent-nzns9                 1/1     Running   0            19h
coroot-node-agent-sg22z                 1/1     Running   0            19h
coroot-node-agent-sz96l                 1/1     Running   0            19h
coroot-node-agent-ts5ld                 1/1     Running   0            19h
coroot-operator-7c556d688b-t2wq4        1/1     Running   0            19h
k describe no kube-infra-14a
  Namespace                   Name                                     CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                     ------------  ----------   ---------------  -------------  ---
  calico-system               calico-node-cff7k                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         91d
  calico-system               csi-node-driver-k7rjl                    0 (0%)        0 (0%)       0 (0%)           0 (0%)         91d
  coroot                      coroot-coroot-0                          7 (43%)       7 (43%)      10Gi (33%)       10Gi (33%)     19h
  coroot                      coroot-node-agent-8n7tl                  100m (0%)     500m (3%)    200Mi (0%)       1Gi (3%)       19h
  kube-system                 kube-proxy-fcs9k                         0 (0%)        0 (0%)       0 (0%)           0 (0%)         91d
  kube-system                 node-local-dns-drc5d                     100m (0%)     0 (0%)       50Mi (0%)        0 (0%)         91d
  monitoring-system           promtail-lqdgw                           250m (1%)     500m (3%)    256Mi (0%)       512Mi (1%)     29d
  monitoring-system           vmagent-vmagent-cbc99886-ztm8n           3100m (19%)   3100m (19%)  4121Mi (13%)     4121Mi (13%)   34d
  monitoring-system           x509-certificate-exporter-nodes-pjpmb    10m (0%)      100m (0%)    20Mi (0%)        40Mi (0%)      91d
  rook-ceph                   csi-cephfsplugin-8zzz2                   300m (1%)     600m (3%)    640Mi (2%)       1280Mi (4%)    91d
  rook-ceph                   csi-rbdplugin-r7xfj                      300m (1%)     600m (3%)    640Mi (2%)       1280Mi (4%)    91d

coroot-node-agent version 1.24.0

Gakhramanzode avatar Jun 03 '25 05:06 Gakhramanzode

@Gakhramanzode please upgrade the agent to v1.25.1

def avatar Jun 05 '25 12:06 def

@def v1.25.1 hasn't helped us :(

Image

Image

mem_profile_kube-infra-14a.tgz

coroot/coroot-node-agent:1.25.1

Gakhramanzode avatar Jun 10 '25 14:06 Gakhramanzode

It seems that you have encountered a metrics leak. This occurs when metrics are continuously created over time and are not reclaimed, which can be due to a bug. I'd suggest analyzing the metrics reported by this particular agent and trying to find the ones with the highest cardinality.

def avatar Jun 10 '25 14:06 def

Am I correct in understanding the following approach?

If I perform a port forward to the coroot-node-agent pod that seems to be leaking memory: kubectl port-forward pod/coroot-node-agent-bv4z8 10300:10300

And then run: curl http://127.0.0.1:10300/metrics > metrics.dump.txt

is analysing the contents of metrics.dump.txt (e.g. looking for the series with the highest label-set cardinality) the right way to pinpoint a potential metrics leak, or is there a better method I should use?

Gakhramanzode avatar Jun 11 '25 07:06 Gakhramanzode