helm-operator
helm-operator copied to clipboard
Helm Charts are cached for a lifetime: disk full
Describe the bug
The Helm Operator can make the ephemeral disk full, due to /root/.cache directory not being GC'd.
The documentation states that Helm Charts:
Are cached for the lifetime duration of the Helm Operator pod.
We just hit a full-disk, the /root/.cache directory was 758 GB.
To Reproduce
Steps to reproduce the behavior: Install Helm Operator and let it run for a long time with 10s of automated releases.
Our automated cycle is 60s, but has been running for ~76d, and it has set off disk usage alarms on our ~50GB /var/lib/docker partition. As you say, it uses /root/.cache and never cleans it up.
Just checked our sandbox cluster, and found:
[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ k get po
NAME READY STATUS RESTARTS AGE
flux-7845ffcf7-hkc6d 1/1 Running 1 9d
helm-operator-779bfdcbb4-2s8lr 0/1 Evicted 0 9d
helm-operator-779bfdcbb4-7f9sq 1/1 Running 0 2d23h
helm-operator-779bfdcbb4-8h7jf 0/1 Evicted 0 22d
helm-operator-779bfdcbb4-cv9kq 0/1 Evicted 0 5d21h
helm-operator-779bfdcbb4-d8n5g 0/1 Evicted 0 9d
helm-operator-779bfdcbb4-dj54f 0/1 Evicted 0 9d
helm-operator-779bfdcbb4-kjkf2 0/1 Evicted 0 9d
helm-operator-779bfdcbb4-pjqq5 0/1 Evicted 0 9d
helm-operator-779bfdcbb4-r75gk 0/1 Evicted 0 15d
helm-operator-779bfdcbb4-wfl9g 0/1 Evicted 0 12d
helm-operator-779bfdcbb4-wk5wt 0/1 Evicted 0 18d
[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ k describe pod helm-operator-779bfdcbb4-2s8lr | grep -B2 -A2 -i evicted
checksum/ssh: d6604e7496d03a9b215b2d84173b3a8df89fe3cc1570cc7d69443e3b5016583a
Status: Failed
Reason: Evicted
Message: Pod The node had condition: [DiskPressure].
IP:
... so yea, I'd say its been filling disks then crashing out and moving somewhere else. :(
[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ helm ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
flux flux 2 2020-06-17 17:18:16.860418 -0600 MDT deployed flux-1.3.0 1.19.0
helm-operator flux 4 2020-09-22 19:48:30.745345631 +0000 UTC deployed helm-operator-1.2.0 1.2.0
[tmcneely@local admin-tools] (⎈ |sea1sbx:flux)$ helm get values helm-operator
USER-SUPPLIED VALUES:
chartsSyncInterval: 2m
git:
pollInterval: 2m
ssh:
known_hosts: |
# bitbucket.company.com:22 SSH-2.0-SSHD-UNKNOWN
# bitbucket.company.com:22 SSH-2.0-SSHD-UNKNOWN
bitbucket.company.com ssh-rsa AAAA(CENSORED)67IHZ
# bitbucket.company.com:22 SSH-2.0-SSHD-UNKNOWN
secretName: flux-ssh
helm:
versions: v3
logReleaseDiffs: true
... so, I stand corrected, our sync interval is 2 mins :)
~tommy
same here running helm-operator v1.2.0 /root/.cache/helm/repository is never cleaned up and fill the whole /var partition.
as workaround we created a sizelimit for the emptydir within the deployment
- emptyDir:
sizeLimit: 1Gi
name: repositories-cache
We are encountering full disks every month.
What is the validation process to get a fix for this going? At a minimum, the size limit, maybe some sort of cache cleanup sidecar? or a command line option to allow it to actually use the cache and not download the entire (sometimes 10MB) index.yml every time?
To clarify what @HaveFun83 said... I added the following to the values: section for the helmrelease (if you aren't managing helm-operator with a HelmRelease, you may have to unindent by 4 spaces)
extraVolumes:
- name: helm-cache
emptyDir:
sizeLimit: 2G
extraVolumeMounts:
- name: helm-cache
mountPath: /root/.cache
The limit does not appear in df -h, but it is enforced...
Normal Started 23m kubelet Started container flux-helm-operator
Warning Evicted 45s kubelet Usage of EmptyDir volume "helm-cache" exceeds the limit "2G".
Normal Killing 45s kubelet Stopping container flux-helm-operator
Warning ExceededGracePeriod 35s kubelet Container runtime did not kill the pod within specified grace period.
NOTE: I artificially caused the eviction using dd.
With 1.2.0 I got a disk full running out of inodes...
$ sudo crictl stats
CONTAINER CPU % MEM DISK INODES
(...)
47e210eb89baf 0.18 351.1MB 41.18GB 10604596
(...)
That's 10 million out of the 11 available inodes...
Looking at the files on disk, this seems to be due to lots of files in /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/532/fs/tmp/flux-working...*/*
That's fascinating! I'd like to fix this, but my command over the Helm Operator repo is limited.
If anyone has greater understanding of this issue and needs someone with write access to help, I'm willing to review a PR.
I think the only way that this could happen is if a git repository source is used, and if it is large enough to consume a bunch of inodes while also being large enough to occasionally time out. Then it would seem possible that in those cases stale, failed clones get left around in a tmp directory, eating up ephemeral storage in an emptyPath volume.
You may also be able to work around this by adjusting some timeout values. I have also seen reports from other users that periodically recycle their helm-operator pods to ensure they don't suffer too long from accumulating problems like this one.
As a rule, we are focused on fixing the scaling issues in Helm Operator through the next version, Helm Controller, which offloads git and helm repository source management to the Source Controller, from "GitOps Toolkit" aka Flux v2. If you can upgrade, you will hopefully not see this issue. (It is a total rewrite, so it would be very surprising if it remains present.)
If there are still issues for you on the new version though, there are more developers currently dedicated to maintaining Flux v2 now than Helm Operator or Flux v1, so it will be easier to get more attention on any problems that you have.
Please feel free to reach out to me on CNCF slack if you have questions, Flux v1 and Helm Operator are supported in maintenance mode and the horizon is seemingly still at least 6 months away, so your issues can definitely still be addressed by community support. (There are also paid options for support available if your needs dictate a greater urgency.)
Sorry if your issue remains unresolved. The Helm Operator is in maintenance mode, we recommend everybody upgrades to Flux v2 and Helm Controller.
A new release of Helm Operator is out this week, 1.4.4.
We will continue to support Helm Operator in maintenance mode for an indefinite period of time, and eventually archive this repository.
Please be aware that Flux v2 has a vibrant and active developer community who are actively working through minor releases and delivering new features on the way to General Availability for Flux v2.
In the mean time, this repo will still be monitored, but support is basically limited to migration issues only. I will have to close many issues today without reading them all in detail because of time constraints. If your issue is very important, you are welcome to reopen it, but due to staleness of all issues at this point a new report is more likely to be in order. Please open another issue if you have unresolved problems that prevent your migration in the appropriate Flux v2 repo.
Helm Operator releases will continue as possible for a limited time, as a courtesy for those who still cannot migrate yet, but these are strongly not recommended for ongoing production use as our strict adherence to semver backward compatibility guarantees limit many dependencies and we can only upgrade them so far without breaking compatibility. So there are likely known CVEs that cannot be resolved.
We recommend upgrading to Flux v2 which is actively maintained ASAP.
I am going to go ahead and close every issue at once today, Thanks for participating in Helm Operator and Flux! 💚 💙