kustomize-controller gets OOMKilled every hour
Background:
The kustomize-controller pod is getting OOMKilled every hour or so. Its reaches around ~7.65G and gets OOM Killed as the memory limit is 8G.
- Image -
ghcr.artifactory.gcp.anz/fluxcd/kustomize-controller:v1.2.2 - There are
184kustomizations in total - Concurrency is set to 20.
These are the flags enabled:
containers:
- args:
- --events-addr=http://notification-controller.flux-system.svc.cluster.local./
- --watch-all-namespaces=true
- --log-level=info
- --log-encoding=json
- --enable-leader-election
- --concurrent=20
- --kube-api-qps=500
- --kube-api-burst=1000
- --requeue-dependency=15s
- --no-remote-bases=true
- --feature-gates=DisableStatusPollerCache=true
Requests & Limits:
resources:
limits:
memory: 8Gi
requests:
cpu: "1"
memory: 8Gi
What's been tried so far:
-
Added the flag
--feature-gates=DisableStatusPollerCache=trueto the kustomize-controller deployment, as mentioned in this issue - But this didn't make a difference, it still gets OOM killed in an hour. -
Reduced the concurrency to
5- At this point, the pod seems stable and memory consumption is around~2.5G -
Did a heap dump and the
inuse_spaceis around~22.64MBwhich is really less. Couldn't find anything useful there, but here's the link to the flamegraph. Also, here's the heap dump - heap.out.zip -
Checked if we have a large repository that's loading unnecessary files as mentioned in this issue
This is from the source-controller:
~ $ du -sh /data/* 6.1M /data/gitrepository 824.0K /data/helmchart 5.8M /data/helmrepository 16.0K /data/lost+found 48.0K /data/ocirepository
Want to understand what is causing the memory spike and OOM killings.
Are you using a RAM disk for the /tmp volume like showed here https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-in-memory-kustomize-builds?
Can you look in /tmp in kustomize-controller pod and see how large is it?
Are you using a RAM disk
We used in-memory kustomizations, but it was being a problem. It keeps exceeding the memory limits of the nodes. We also tried using Ephemeral SSDs, they got corrupted when the kustomize-controller restarted. So currently the /tmp is backed by a disk.
The size of the /tmp is 12.7G
$ du -sh tmp
12.7G tmp
Ok so looks like all these problems are due to FS operations. The tmp should be empty almost all the time. Is there anything inside the repo that could cause this, recursive symlinks or such? Looking at the memory profile the issue seems related to Go untar and file read operations which are all from Go stdlib.