helm-operator Helm Charts are cached for a lifetime: disk full

trafficstars

Describe the bug

The Helm Operator can make the ephemeral disk full, due to /root/.cache directory not being GC'd.

The documentation states that Helm Charts:

Are cached for the lifetime duration of the Helm Operator pod.

We just hit a full-disk, the /root/.cache directory was 758 GB.

To Reproduce

Steps to reproduce the behavior: Install Helm Operator and let it run for a long time with 10s of automated releases.

Oct 11 '20 08:10 zzvara

Our automated cycle is 60s, but has been running for ~76d, and it has set off disk usage alarms on our ~50GB /var/lib/docker partition. As you say, it uses /root/.cache and never cleans it up.

Just checked our sandbox cluster, and found:

[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ k get po
NAME                             READY   STATUS    RESTARTS   AGE
flux-7845ffcf7-hkc6d             1/1     Running   1          9d
helm-operator-779bfdcbb4-2s8lr   0/1     Evicted   0          9d
helm-operator-779bfdcbb4-7f9sq   1/1     Running   0          2d23h
helm-operator-779bfdcbb4-8h7jf   0/1     Evicted   0          22d
helm-operator-779bfdcbb4-cv9kq   0/1     Evicted   0          5d21h
helm-operator-779bfdcbb4-d8n5g   0/1     Evicted   0          9d
helm-operator-779bfdcbb4-dj54f   0/1     Evicted   0          9d
helm-operator-779bfdcbb4-kjkf2   0/1     Evicted   0          9d
helm-operator-779bfdcbb4-pjqq5   0/1     Evicted   0          9d
helm-operator-779bfdcbb4-r75gk   0/1     Evicted   0          15d
helm-operator-779bfdcbb4-wfl9g   0/1     Evicted   0          12d
helm-operator-779bfdcbb4-wk5wt   0/1     Evicted   0          18d
[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ k describe pod helm-operator-779bfdcbb4-2s8lr | grep -B2 -A2 -i evicted
                checksum/ssh: d6604e7496d03a9b215b2d84173b3a8df89fe3cc1570cc7d69443e3b5016583a
Status:         Failed
Reason:         Evicted
Message:        Pod The node had condition: [DiskPressure].
IP:

... so yea, I'd say its been filling disks then crashing out and moving somewhere else. :(

[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ helm ls
NAME         	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART              	APP VERSION
flux         	flux     	2       	2020-06-17 17:18:16.860418 -0600 MDT   	deployed	flux-1.3.0         	1.19.0
helm-operator	flux     	4       	2020-09-22 19:48:30.745345631 +0000 UTC	deployed	helm-operator-1.2.0	1.2.0

[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ helm get values helm-operator
USER-SUPPLIED VALUES:
chartsSyncInterval: 2m
git:
  pollInterval: 2m
  ssh:
    known_hosts: |
      # bitbucket.company.com:22 SSH-2.0-SSHD-UNKNOWN
      # bitbucket.company.com:22 SSH-2.0-SSHD-UNKNOWN
      bitbucket.company.com ssh-rsa AAAA(CENSORED)67IHZ
      # bitbucket.company.com:22 SSH-2.0-SSHD-UNKNOWN
    secretName: flux-ssh
helm:
  versions: v3
logReleaseDiffs: true

... so, I stand corrected, our sync interval is 2 mins :)

~tommy

Oct 14 '20 21:10 TJM

same here running helm-operator v1.2.0 /root/.cache/helm/repository is never cleaned up and fill the whole /var partition.

Oct 25 '20 19:10 HaveFun83

as workaround we created a sizelimit for the emptydir within the deployment

      - emptyDir:
          sizeLimit: 1Gi
        name: repositories-cache

Oct 26 '20 09:10 HaveFun83

We are encountering full disks every month.

Nov 03 '20 11:11 zzvara

What is the validation process to get a fix for this going? At a minimum, the size limit, maybe some sort of cache cleanup sidecar? or a command line option to allow it to actually use the cache and not download the entire (sometimes 10MB) index.yml every time?

Dec 23 '20 16:12 TJM

To clarify what @HaveFun83 said... I added the following to the values: section for the helmrelease (if you aren't managing helm-operator with a HelmRelease, you may have to unindent by 4 spaces)

    extraVolumes:
      - name: helm-cache
        emptyDir:
          sizeLimit: 2G
    extraVolumeMounts:
      - name: helm-cache
        mountPath: /root/.cache

The limit does not appear in df -h, but it is enforced...

  Normal   Started              23m   kubelet            Started container flux-helm-operator
  Warning  Evicted              45s   kubelet            Usage of EmptyDir volume "helm-cache" exceeds the limit "2G".
  Normal   Killing              45s   kubelet            Stopping container flux-helm-operator
  Warning  ExceededGracePeriod  35s   kubelet            Container runtime did not kill the pod within specified grace period.

NOTE: I artificially caused the eviction using dd.

Jan 14 '21 20:01 TJM

With 1.2.0 I got a disk full running out of inodes...

$ sudo crictl stats
CONTAINER           CPU %               MEM                 DISK                INODES
(...)
47e210eb89baf       0.18                351.1MB             41.18GB             10604596
(...)

That's 10 million out of the 11 available inodes... Looking at the files on disk, this seems to be due to lots of files in /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/532/fs/tmp/flux-working...*/*

Apr 09 '21 08:04 srgvg

That's fascinating! I'd like to fix this, but my command over the Helm Operator repo is limited.

If anyone has greater understanding of this issue and needs someone with write access to help, I'm willing to review a PR.

I think the only way that this could happen is if a git repository source is used, and if it is large enough to consume a bunch of inodes while also being large enough to occasionally time out. Then it would seem possible that in those cases stale, failed clones get left around in a tmp directory, eating up ephemeral storage in an emptyPath volume.

You may also be able to work around this by adjusting some timeout values. I have also seen reports from other users that periodically recycle their helm-operator pods to ensure they don't suffer too long from accumulating problems like this one.

As a rule, we are focused on fixing the scaling issues in Helm Operator through the next version, Helm Controller, which offloads git and helm repository source management to the Source Controller, from "GitOps Toolkit" aka Flux v2. If you can upgrade, you will hopefully not see this issue. (It is a total rewrite, so it would be very surprising if it remains present.)

If there are still issues for you on the new version though, there are more developers currently dedicated to maintaining Flux v2 now than Helm Operator or Flux v1, so it will be easier to get more attention on any problems that you have.

Please feel free to reach out to me on CNCF slack if you have questions, Flux v1 and Helm Operator are supported in maintenance mode and the horizon is seemingly still at least 6 months away, so your issues can definitely still be addressed by community support. (There are also paid options for support available if your needs dictate a greater urgency.)

Apr 09 '21 14:04 kingdonb

Sorry if your issue remains unresolved. The Helm Operator is in maintenance mode, we recommend everybody upgrades to Flux v2 and Helm Controller.

A new release of Helm Operator is out this week, 1.4.4.

We will continue to support Helm Operator in maintenance mode for an indefinite period of time, and eventually archive this repository.

Please be aware that Flux v2 has a vibrant and active developer community who are actively working through minor releases and delivering new features on the way to General Availability for Flux v2.

In the mean time, this repo will still be monitored, but support is basically limited to migration issues only. I will have to close many issues today without reading them all in detail because of time constraints. If your issue is very important, you are welcome to reopen it, but due to staleness of all issues at this point a new report is more likely to be in order. Please open another issue if you have unresolved problems that prevent your migration in the appropriate Flux v2 repo.

Helm Operator releases will continue as possible for a limited time, as a courtesy for those who still cannot migrate yet, but these are strongly not recommended for ongoing production use as our strict adherence to semver backward compatibility guarantees limit many dependencies and we can only upgrade them so far without breaking compatibility. So there are likely known CVEs that cannot be resolved.

We recommend upgrading to Flux v2 which is actively maintained ASAP.

I am going to go ahead and close every issue at once today, Thanks for participating in Helm Operator and Flux! 💚 💙

Sep 02 '22 19:09 kingdonb

helm-operator helm-operator copied to clipboard

Helm Charts are cached for a lifetime: disk full

helm-operator
helm-operator copied to clipboard