kustomize-controller icon indicating copy to clipboard operation
kustomize-controller copied to clipboard

kustomize-controller gets OOMKilled every hour

Open bharathvrajan opened this issue 1 year ago • 3 comments

Background:

The kustomize-controller pod is getting OOMKilled every hour or so. Its reaches around ~7.65G and gets OOM Killed as the memory limit is 8G.

  • Image - ghcr.artifactory.gcp.anz/fluxcd/kustomize-controller:v1.2.2
  • There are 184 kustomizations in total
  • Concurrency is set to 20.

These are the flags enabled:

      containers:
      - args:
        - --events-addr=http://notification-controller.flux-system.svc.cluster.local./
        - --watch-all-namespaces=true
        - --log-level=info
        - --log-encoding=json
        - --enable-leader-election
        - --concurrent=20
        - --kube-api-qps=500
        - --kube-api-burst=1000
        - --requeue-dependency=15s
        - --no-remote-bases=true
        - --feature-gates=DisableStatusPollerCache=true

Requests & Limits:

        resources:
          limits:
            memory: 8Gi
          requests:
            cpu: "1"
            memory: 8Gi

What's been tried so far:

  1. Added the flag --feature-gates=DisableStatusPollerCache=true to the kustomize-controller deployment, as mentioned in this issue - But this didn't make a difference, it still gets OOM killed in an hour.

  2. Reduced the concurrency to 5 - At this point, the pod seems stable and memory consumption is around ~2.5G

  3. Did a heap dump and the inuse_space is around ~22.64MB which is really less. Couldn't find anything useful there, but here's the link to the flamegraph. Also, here's the heap dump - heap.out.zip

  4. Checked if we have a large repository that's loading unnecessary files as mentioned in this issue

    This is from the source-controller:

    ~ $ du -sh /data/*
    6.1M	     /data/gitrepository
    824.0K     /data/helmchart
    5.8M	    /data/helmrepository
    16.0K	    /data/lost+found
    48.0K	    /data/ocirepository 
    

Want to understand what is causing the memory spike and OOM killings.

bharathvrajan avatar Mar 15 '24 01:03 bharathvrajan

Are you using a RAM disk for the /tmp volume like showed here https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-in-memory-kustomize-builds?

Can you look in /tmp in kustomize-controller pod and see how large is it?

stefanprodan avatar Mar 15 '24 06:03 stefanprodan

Are you using a RAM disk

We used in-memory kustomizations, but it was being a problem. It keeps exceeding the memory limits of the nodes. We also tried using Ephemeral SSDs, they got corrupted when the kustomize-controller restarted. So currently the /tmp is backed by a disk.

The size of the /tmp is 12.7G

$ du -sh tmp
12.7G	tmp

bharathvrajan avatar Mar 17 '24 23:03 bharathvrajan

Ok so looks like all these problems are due to FS operations. The tmp should be empty almost all the time. Is there anything inside the repo that could cause this, recursive symlinks or such? Looking at the memory profile the issue seems related to Go untar and file read operations which are all from Go stdlib.

stefanprodan avatar Mar 18 '24 08:03 stefanprodan