helm-operator CPU runaway / memory leak in yaml parser

Describe the bug

We started seeing random crashes regarding liveness probes failing in out helm-operator installations. After looking into a profile taken from a running one we saw that the CPU and memory usage climb until the process itself is not responsive anymore.

Another behaviour we saw was that helm release objects get into pending-update state which we have to manually cleanup, i guess thats due to the stale "starting sync run"

To Reproduce

Steps to reproduce the behaviour:

Install helm operator with cpu and memory limits of 500m and 256Mi
Install some charts from helm stable charts
See how the helm operator gets cleaned up by liveness probe leaving helm releases broken

Expected behavior

Not crashing and not corrupting helm releases

Logs

helm-operator logs:

W0524 08:49:49.261320       6 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ts=2020-05-24T08:49:49.29648543Z caller=operator.go:82 component=operator info="setting up event handlers"
ts=2020-05-24T08:49:49.29653343Z caller=operator.go:98 component=operator info="event handlers set up"
ts=2020-05-24T08:49:49.296856635Z caller=main.go:287 component=helm-operator info="waiting for informer caches to sync"
ts=2020-05-24T08:49:49.397055732Z caller=main.go:292 component=helm-operator info="informer caches synced"
ts=2020-05-24T08:49:49.397139534Z caller=git.go:104 component=gitchartsync info="starting sync of git chart sources"
ts=2020-05-24T08:49:49.397141434Z caller=operator.go:110 component=operator info="starting operator"
ts=2020-05-24T08:49:49.397198135Z caller=operator.go:112 component=operator info="starting workers"
ts=2020-05-24T08:49:49.398192549Z caller=server.go:42 component=daemonhttp info="starting HTTP server on :3030"
ts=2020-05-24T08:49:49.398508054Z caller=release.go:75 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:49.398857159Z caller=release.go:75 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:50.189199369Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=0.10.1
ts=2020-05-24T08:49:50.989017121Z caller=release.go:249 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="running dry-run upgrade to compare with release version '5'" action=dry-run-compare
ts=2020-05-24T08:49:50.991875863Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:51.00369044Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:53.915386944Z caller=helm.go:69 component=helm version=v3 info="performing update for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:54.100451308Z caller=helm.go:69 component=helm version=v3 info="dry run for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:54.392106666Z caller=release.go:268 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:49:54.392299469Z caller=release.go:75 component=release release=elasticsearch targetNamespace=graylog resource=graylog:helmrelease/elasticsearch helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:58.510576689Z caller=release.go:105 component=release release=elasticsearch targetNamespace=graylog resource=graylog:helmrelease/elasticsearch helmVersion=v3 error="failed to determine sync action for release: status 'pending-upgrade' of release does not allow a safe upgrade"
ts=2020-05-24T08:49:58.511182698Z caller=release.go:75 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:04.356173295Z caller=release.go:249 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="running dry-run upgrade to compare with release version '1'" action=dry-run-compare
ts=2020-05-24T08:50:04.356167495Z caller=release.go:249 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="running dry-run upgrade to compare with release version '1'" action=dry-run-compare
ts=2020-05-24T08:50:04.499243431Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:04.49980734Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:04.516823894Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:04.550429796Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.900198651Z caller=helm.go:69 component=helm version=v3 info="performing update for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.906840951Z caller=helm.go:69 component=helm version=v3 info="performing update for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:05.909939397Z caller=helm.go:69 component=helm version=v3 info="dry run for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.916438994Z caller=helm.go:69 component=helm version=v3 info="dry run for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:06.29659257Z caller=release.go:268 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:06.303724677Z caller=release.go:75 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:06.503599862Z caller=release.go:268 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:06.503798764Z caller=release.go:75 component=release release=chartmuseum targetNamespace=chartmuseum resource=chartmuseum:helmrelease/chartmuseum helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:13.194666158Z caller=release.go:249 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="running dry-run upgrade to compare with release version '3'" action=dry-run-compare
ts=2020-05-24T08:50:14.195036991Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:14.488930578Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.392153912Z caller=helm.go:69 component=helm version=v3 info="performing update for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.691649982Z caller=helm.go:69 component=helm version=v3 info="dry run for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.991267853Z caller=release.go:268 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:17.993421085Z caller=release.go:75 component=release release=graylog targetNamespace=graylog resource=graylog:helmrelease/graylog helmVersion=v3 info="starting sync run"

pprof top 10:

Showing top 10 nodes out of 371
      flat  flat%   sum%        cum   cum%
    1610ms 15.32% 15.32%     1610ms 15.32%  runtime.memclrNoHeapPointers
     850ms  8.09% 23.41%      850ms  8.09%  math/big.addMulVVW
     420ms  4.00% 27.40%      800ms  7.61%  runtime.scanobject
     290ms  2.76% 30.16%     1160ms 11.04%  math/big.nat.montgomery
     280ms  2.66% 32.83%      850ms  8.09%  gopkg.in/yaml%2ev2.yaml_parser_scan_plain_scalar
     270ms  2.57% 35.39%      270ms  2.57%  runtime.futex
     270ms  2.57% 37.96%      270ms  2.57%  syscall.Syscall
     250ms  2.38% 40.34%     2950ms 28.07%  runtime.mallocgc
     220ms  2.09% 42.44%      220ms  2.09%  runtime.memmove
     170ms  1.62% 44.05%      230ms  2.19%  encoding/json.checkValid

Additional context

Helm Operator version: 1.1.0, 1.0.1 had the same issue (downgraded for testing)
Kubernetes version: 1.15.10
Git provider: Custom gitlab
Helm repository provider: Random (it seems to happen with https://kubernetes-charts.storage.googleapis.com [graylog, nginx-ingress, etc.], https://charts.gitlab.io [gitlab, gitlab-runner])

Maybe related things After some search i found this:

https://github.com/yaml/libyaml/issues/111 https://github.com/yaml/libyaml/issues/115

which leads me to believe that there is a serious issue in the YAML parsing part which can bring the whole application down without any notice

Current index.yaml from helm stable: index.yaml.zip

Gitlab index.yaml gitlab_index.yaml.zip

May 24 '20 09:05 geNAZt

Can you try running the Helm Operator with the following environmental flag enabled: GODEBUG=madvdontneed=1? I am wondering if we are bitten by the new GC behaviour of Golang since >=1.12.

May 25 '20 15:05 hiddeco

The change in GODEBUG did change one thing (when going back to 256Mi), it gets OOMKilled instead of cycling forever trying to free up memory. But the final result is the same, it crashes or gets killed.

Setting it to 512 MB (fixed) and 200m CPU runs fine for me. Are there any guidelines which show what limits are acceptable for the operator? For my personal liking 512 MB for a system which applies helm releases is a bit much :/

// EDIT After some further investigation it seems that running with 200m is the limit which is needed. If i give less CPU it crashes right away after 30 seconds, giving 500m allows me to reduce the memory. This leads me to believe that the GC doesn't get enough time to cleanup.

// EDIT 2 To clear up some things i will provide a table which crashes and what

With GODEBUG=madvdontneed=1

	256MB memory	512MB memory
200m CPU	crashes (oomkilled)	crashes (after 5-10 minutes, gets liveness probe failures)
500m CPU	crashes (oomkilled)	works

Without any env:

	256MB memory	512MB memory
200m CPU	crashes (after 5-10 minutes, gets liveness probe failures)	crashes (after 5-10 minutes, gets liveness probe failures)
500m CPU	crashes (after 5-10 minutes, gets liveness probe failures)	works

Letting the helm operator run without any limits it seems to allocate ~300MB from the host (kubectl top pods reports that) and use around 100-120m CPU which seems weird to me

May 25 '20 17:05 geNAZt

@geNAZt are you by any chance active on either the CNCF or Weaveworks Slack community? Would like to provide some insights (and have a chat about it). I am @hidde on both communities.

May 25 '20 18:05 hiddeco

No, if you have a link i would be glad to join

May 25 '20 18:05 geNAZt

https://slack.cncf.io for an invite, you can find me in #flux (and #flux-dev).

May 25 '20 18:05 hiddeco

I am also experiencing this problem; with and without the GODEBUG=madvdontneed=1 flag, getting periodic OOMKilled with version 1.1.0 and below resources. Has there been any progress on this issue?

Limits:
  cpu:     1500m
  memory:  2Gi
Requests:
  cpu:      1
  memory:   1512M

Jul 17 '20 14:07 ewarwick

Letting the helm operator run without any limits it seems to allocate ~300MB from the host (kubectl top pods reports that) and use around 100-120m CPU which seems weird to me

We had experienced the same situation. The issue appeared a few days ago, and when we removed limits configuration for helm-operator, it stopped periodic OOMKilled.

Oct 15 '20 04:10 kuwabarahiroshi

Having the same issue, without GODEBUG flag. These are my resource settings:

requests:
  cpu: 50m
  memory: 256Mi
limits:
  cpu: 1
  memory: 1Gi

Have raised the limit to 2G memory, but even that gets OOM'd. Running 1.2.0 at the moment.

Oct 19 '20 11:10 jonkerj

We are seeing this as well. I just noticed that over the last week one of our Helm Operator instances has grown to use 3 GB of RAM using the default requests/limits (see image below), although it continues to function fine. I've set CPU requests to 500m and memory requests and limits to 512Mi for this deployment to see what difference it makes.

Screen Shot 2020-10-26 at 2 45 04 PM

Oct 26 '20 21:10 Rathgore

In another environment it is rising well above 3 GB then fails the health check and restarts as described in the first comment.

Screen Shot 2020-10-26 at 2 58 15 PM

Oct 26 '20 22:10 Rathgore

As others have implied, the best workaround seems to be setting memory limits such that memory growth is bounded at a low enough level that a stop-the-world GC has enough time to complete before triggering a liveness probe failure and a restart of the pod. The memory growth needs to be low enough and the container needs to have enough CPU allocated to complete the collection in time. Attempting to work around the issue by setting a really high memory limit will probably just make the problem worse, not giving enough time for a collection of that size.

Here's what I've done:

Set memory requests and limits to 512Mi
Set CPU requests to 500m with 1000m limit to provide enough CPU for the collections to complete
Increased the liveness probe failure threshold to 10, to ensure that a few failures while the process is blocked don't cause it to restart.

This combination seems to be working so far. Memory usage grows to about 200Mi over the course of 45 minutes or so, then a collection happens. A few container unhealthy alerts are fired because of the blocked process, but because of the failure threshold increase it's not enough to trigger a restart. I could probably eliminate the unhealthy events by tuning some of the other probe thresholds.

I will report back after running this way for a few weeks to see if this workaround is still solid.

Oct 27 '20 02:10 Rathgore

After additional testing, we've figured out the main trigger for the memory leaks, at least in our case. We had some inexperienced chart developers pushing broken charts that were staying in a failing deploy loop for multiple days. This is what was causing the memory leaks in our case. After getting rid of the failing deployments memory usage has been flat for over 24 hours now. It only took 4 failing charts to balloon memory beyond 3 GB over the period of a week or so.

With the seemingly imminent (and exciting) arrival of GOTK/Flux 2 I'm going to guess this won't be fixed in the classic Helm Operator any time soon, but we have a workable solution for now until we can move to next-gen Flux.

Oct 29 '20 00:10 Rathgore

Sorry if your issue remains unresolved. The Helm Operator is in maintenance mode, we recommend everybody upgrades to Flux v2 and Helm Controller.

A new release of Helm Operator is out this week, 1.4.4.

We will continue to support Helm Operator in maintenance mode for an indefinite period of time, and eventually archive this repository.

Please be aware that Flux v2 has a vibrant and active developer community who are actively working through minor releases and delivering new features on the way to General Availability for Flux v2.

In the mean time, this repo will still be monitored, but support is basically limited to migration issues only. I will have to close many issues today without reading them all in detail because of time constraints. If your issue is very important, you are welcome to reopen it, but due to staleness of all issues at this point a new report is more likely to be in order. Please open another issue if you have unresolved problems that prevent your migration in the appropriate Flux v2 repo.

Helm Operator releases will continue as possible for a limited time, as a courtesy for those who still cannot migrate yet, but these are strongly not recommended for ongoing production use as our strict adherence to semver backward compatibility guarantees limit many dependencies and we can only upgrade them so far without breaking compatibility. So there are likely known CVEs that cannot be resolved.

We recommend upgrading to Flux v2 which is actively maintained ASAP.

I am going to go ahead and close every issue at once today, Thanks for participating in Helm Operator and Flux! 💚 💙

Sep 02 '22 19:09 kingdonb