coroot-node-agent
coroot-node-agent copied to clipboard
Memory leak in node-agent
I use Coroot in Kubernetes cluster and notice some problem resources utilization in one of the coroot-node-agent pods.
The pod coroot-node-agent-szq9x, which runs on the monitoring-node, shows very high CPU and memory usage. Eventually, the pod gets Out of Memory. Other agents in the cluster work normally.
Below screenshot, we see high CPU and RAM utilization. We cath Out of Memory.
kubectl describe po coroot-node-agent-szq9x
...
State: Running
Started: Wed, 16 Apr 2025 11:37:38 +0300
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 16 Apr 2025 11:19:23 +0300
Finished: Wed, 16 Apr 2025 11:37:37 +0300
Ready: True
Restart Count: 29
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 1
memory: 1Gi
...
Pod coroot-node-agent-szq9x works on monitoring-node:
kubectl describe kube-main-4a
...
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-v95hp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 178d
calico-system csi-node-driver-zcjsv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 178d
coroot coroot-cluster-agent-5db66bb4f8-c9hkg 2 (25%) 2 (25%) 8Gi (26%) 8Gi (26%) 44h
coroot coroot-coroot-0 0 (0%) 0 (0%) 10Gi (33%) 10Gi (33%) 4d18h
coroot coroot-node-agent-szq9x 1 (12%) 1 (12%) 1Gi (3%) 1Gi (3%) 17h
coroot coroot-operator-6fbc557687-kgwbq 100m (1%) 500m (6%) 64Mi (0%) 1Gi (3%) 4d22h
kube-state-metrics kube-state-metrics-7fc9d8bd9-nblzm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
kube-system kube-proxy-9mfwr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 621d
kube-system node-local-dns-fvxfm 100m (1%) 0 (0%) 50Mi (0%) 0 (0%) 607d
monitoring-system vmagent-vmagent-5bcddb78f7-xxk87 2100m (26%) 4100m (51%) 4121Mi (13%) 4121Mi (13%) 64d
monitoring-system vmalert-vmalert-5fbf78cb8b-z9sgr 150m (1%) 300m (3%) 225Mi (0%) 525Mi (1%) 44h
monitoring-system vmoperator-victoria-metrics-operator-67ffc558c8-44n7c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
monitoring-system x509-certificate-exporter-nodes-65r9f 10m (0%) 100m (1%) 20Mi (0%) 40Mi (0%) 517d
observability jaeger-agent-daemonset-qs8h5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 608d
observability jaeger-collector-858fb8d797-crgvf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
observability jaeger-collector-858fb8d797-g4rgs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
observability jaeger-operator-df86f5749-8tk9j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
observability jaeger-query-665966b4f7-c79wr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
rook-ceph csi-cephfsplugin-ttnwm 300m (3%) 600m (7%) 640Mi (2%) 1280Mi (4%) 514d
rook-ceph csi-rbdplugin-lb684 300m (3%) 600m (7%) 640Mi (2%) 1280Mi (4%) 514d
tigera-operator tigera-operator-6f488755c5-5spgt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 178d
...
Attache pprof file: mem_profile_kube-main-4a_new_new_2.tgz mem_profile_kube-main-4a_new_new.tgz mem_profile_kube-main-4a_new.tgz mem_profile_kube-main-4a.tgz
Environment:
- Coroot version: 1.9.10
- Coroot Node Agent: 1.23.14
- Kubernetes version: v1.25.12
@def @apetruhin hello all, I attach a new example
mem_profile_kube-infra-14a.tgz
Environment:
- Coroot version: 1.11.4
- Coroot Node Agent: 1.23.23
I have the same, but I limited the agent at 12Gi and still I see this issue, OOMKilled. I tried to remove the limit and have seen 45Gi memory usage.
Coroot version 1.11.4 Agent: 1.23.25
@stepanselyuk can you share a memory profile?
@def Hello!
I have updated coroot-node-agent since 1.23.26 to 1.24.0 version. Your fix looks not bad.
@def @apetruhin sadly, we are seeing again CPU throttling and high RAM utilization on coroot-node-agent.
mem_profile_kube-infra-14a.tgz
NAME READY STATUS RESTARTS AGE
coroot-cluster-agent-5549dc6f5d-c5sh5 2/2 Running 0 19h
coroot-coroot-0 1/1 Running 0 19h
coroot-node-agent-7mcz9 1/1 Running 0 19h
coroot-node-agent-84t9l 1/1 Running 0 19h
coroot-node-agent-8n7tl 1/1 Running 1 (9h ago) 19h
coroot-node-agent-bmzn6 1/1 Running 0 19h
coroot-node-agent-ckqnt 1/1 Running 0 19h
coroot-node-agent-cnn9m 1/1 Running 0 19h
coroot-node-agent-crl2v 1/1 Running 0 19h
coroot-node-agent-fgg78 1/1 Running 0 19h
coroot-node-agent-gwtr2 1/1 Running 0 19h
coroot-node-agent-n8mkj 1/1 Running 0 19h
coroot-node-agent-nzns9 1/1 Running 0 19h
coroot-node-agent-sg22z 1/1 Running 0 19h
coroot-node-agent-sz96l 1/1 Running 0 19h
coroot-node-agent-ts5ld 1/1 Running 0 19h
coroot-operator-7c556d688b-t2wq4 1/1 Running 0 19h
k describe no kube-infra-14a
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-cff7k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 91d
calico-system csi-node-driver-k7rjl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 91d
coroot coroot-coroot-0 7 (43%) 7 (43%) 10Gi (33%) 10Gi (33%) 19h
coroot coroot-node-agent-8n7tl 100m (0%) 500m (3%) 200Mi (0%) 1Gi (3%) 19h
kube-system kube-proxy-fcs9k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 91d
kube-system node-local-dns-drc5d 100m (0%) 0 (0%) 50Mi (0%) 0 (0%) 91d
monitoring-system promtail-lqdgw 250m (1%) 500m (3%) 256Mi (0%) 512Mi (1%) 29d
monitoring-system vmagent-vmagent-cbc99886-ztm8n 3100m (19%) 3100m (19%) 4121Mi (13%) 4121Mi (13%) 34d
monitoring-system x509-certificate-exporter-nodes-pjpmb 10m (0%) 100m (0%) 20Mi (0%) 40Mi (0%) 91d
rook-ceph csi-cephfsplugin-8zzz2 300m (1%) 600m (3%) 640Mi (2%) 1280Mi (4%) 91d
rook-ceph csi-rbdplugin-r7xfj 300m (1%) 600m (3%) 640Mi (2%) 1280Mi (4%) 91d
coroot-node-agent version 1.24.0
@Gakhramanzode please upgrade the agent to v1.25.1
It seems that you have encountered a metrics leak. This occurs when metrics are continuously created over time and are not reclaimed, which can be due to a bug. I'd suggest analyzing the metrics reported by this particular agent and trying to find the ones with the highest cardinality.
Am I correct in understanding the following approach?
If I perform a port forward to the coroot-node-agent pod that seems to be leaking memory:
kubectl port-forward pod/coroot-node-agent-bv4z8 10300:10300
And then run:
curl http://127.0.0.1:10300/metrics > metrics.dump.txt
is analysing the contents of metrics.dump.txt (e.g. looking for the series with the highest label-set cardinality) the right way to pinpoint a potential metrics leak, or is there a better method I should use?