source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

source-controller OOM events

Open robparrott opened this issue 3 years ago • 17 comments

Describe the bug

When registering FluxCD to a repository in gitlab enterprise, I am seeing OOM activity on the source-controller pod. Removing the 1GB memory limit fixes the issues.

To Reproduce

Register fluxcd on a repo with some level of complexity, I believe.

Expected behavior

The source-controller pod should not be killed and restarted repeatedly.

Additional context

  • Kubernetes version: 1.19
  • Git provider: gitlab self-hosted
  • Container registry provider: gitlab/ECR

Below please provide the output of the following commands:

flux --version : flux version 0.8.0
flux check
► checking prerequisites
✔ kubectl 1.19.3 >=1.18.0
✔ Kubernetes 1.19.6-eks-49a6c0 >=1.16.0
► checking controllers

✔ source-controller: healthy
► ghcr.io/fluxcd/source-controller:v0.8.1
✔ kustomize-controller: healthy
► ghcr.io/fluxcd/kustomize-controller:v0.8.1
✔ helm-controller: healthy
► ghcr.io/fluxcd/helm-controller:v0.7.0
✔ notification-controller: healthy
► ghcr.io/fluxcd/notification-controller:v0.8.0
✔ all checks passed
kubectl -n <namespace> get all
kubectl -n flux-system get all
NAME                                           READY   STATUS             RESTARTS   AGE
pod/helm-controller-6946b6dc7f-5nr8q           1/1     Running            0          9m34s
pod/kustomize-controller-55dfcdfd58-xj25c      1/1     Running            0          10h
pod/notification-controller-649754966b-2677x   1/1     Running            0          10h
pod/source-controller-597cc769b-lp6w4          0/1     CrashLoopBackOff   5          6m23s

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/notification-controller   ClusterIP   10.100.114.245   <none>        80/TCP    10h
service/source-controller         ClusterIP   10.100.185.20    <none>        80/TCP    10h
service/webhook-receiver          ClusterIP   10.100.198.200   <none>        80/TCP    10h

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/helm-controller           1/1     1            1           10h
deployment.apps/kustomize-controller      1/1     1            1           10h
deployment.apps/notification-controller   1/1     1            1           10h
deployment.apps/source-controller         0/1     1            0           10h

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/helm-controller-6779d46d69           0         0         0       10h
replicaset.apps/helm-controller-6946b6dc7f           1         1         1       9m34s
replicaset.apps/kustomize-controller-55dfcdfd58      1         1         1       10h
replicaset.apps/notification-controller-649754966b   1         1         1       10h
replicaset.apps/source-controller-555d4f9d6          0         0         0       10h
replicaset.apps/source-controller-597cc769b          1         1         0       10h




kubectl -n <namespace> logs deploy/source-controller

-- various without errors until killed ---

kubectl -n <namespace> logs deploy/kustomize-controller

-- various ---

level":"info","ts":"2021-02-24T00:06:40.724Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"istio-system","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.811Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.815Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"error","ts":"2021-02-24T00:06:41.825Z","logger":"controller.kustomization","msg":"Reconciliation failed after 1.059192016s, next try in 5m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"podinfo","namespace":"flux-system","revision"
:"master/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7","error":"failed to download artifact from http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz, error: Get \"http://source-controller.flux-system.svc.cl
uster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz\": dial tcp 10.100.185.20:80: connect: connection refused"}
{"level":"info","ts":"2021-02-24T00:06:41.843Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.833Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.834Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.855Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.863Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.872Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.874Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.875Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.893Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}


robparrott avatar Feb 24 '21 00:02 robparrott

Changing the source-controller deployment resources stanza as follows:

        resources:
          limits:
            cpu: 1000m
            #memory: 1Gi
          requests:
            cpu: 50m
            #memory: 64Mi

addresses the issue

robparrott avatar Feb 24 '21 00:02 robparrott

I had the same issue but this time increasing the memory limts to 2Gi did mitigate the issue

mahmoud-abdelhafez avatar May 05 '21 14:05 mahmoud-abdelhafez

I am seeing OOMs with 2Gi and I am on v0.14.1.

hihellobolke avatar Aug 24 '21 04:08 hihellobolke

Same here on flux2 version 0.16.2. Increasing the memory limits to 2Gi mitigated the issue.

thomasroot avatar Aug 25 '21 08:08 thomasroot

This issue seems to be linked to: https://github.com/fluxcd/source-controller/issues/192 Our clusters also suffer from this issue, we see memory usages of 1-2GB.

Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.

runningman84 avatar Aug 26 '21 11:08 runningman84

I was able to trigger this issue by putting interval: 1d in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM

kav avatar Sep 02 '21 20:09 kav

As with any workload on Kubernetes, the right resource limit configuration highly depends on what you are making the source-controller do (and you may thus have to increase it).

Helm related operations for example, are resource intensive because at present we haven't found a right optimization path to work with repository index files without loading them in memory in full (due to certain constraints around the unmarshalling of YAML).

Combined with the popularity of some solutions like Artifactory, which likes to stuff as much as possible in a single index (in some cases resulting in a file of >100MB), and the fact that the reconciliation of resources is isolated, resource usage exceeding the defaults can be expected.

Another task that can be resource intensive is the packaging of a Helm chart from a Git source, because Helm first loads all the chart data into an object in memory (including all files, and the files of the dependencies), before writing it to disk.

For a fun experiment: check the current resources your CI worker nodes have (or ask around), or monitor the resource usage of various helm commands on your local machine, and then take into account that the controller does this in parallel with multiple workers, for multiple resources.


Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.

The controller does much more than just downloading files, and I think you are oversimplifying or underestimating the inner workings of the controller, and ignoring the fact that it has several features that perform composition tasks, etc. In addition, to ensure proper isolation of e.g. credentials, most Git things are done in memory as well.

I was able to trigger this issue by putting interval: 1d in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM

Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.


Lastly, we are continuously looking into ways to reduce the footprint of our controllers, and I can already tell you some paths have been identified (and are actively worked on) to help reduce it.

Do however always keep in mind that while the YAML creates simple looking and composable abstractions, there will always be processes behind it that actually execute the task, and that the hardware of your local development machine often outperforms most containers.

hiddeco avatar Sep 02 '21 21:09 hiddeco

Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.

No, it appears 1d is simply not valid per the log. Sorry should have included that

E0902 19:20:30.626842       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1beta1.HelmRepository: failed to list *v1beta1.HelmRepository: v1beta1.HelmRepositoryList.Items: []v1beta1.HelmRepository: v1beta1.HelmRepository.Spec: v1beta1.HelmRepositorySpec.Timeout: Interval: unmarshalerDecoder: time: unknown unit "d" in duration "1d", error found in #10 byte of ...|rval":"1d","timeout"|..., bigger context ...|0-4596-8543-9d6d4b573433"},"spec":{"interval":"1d","timeout":"60s","url":"https://raw.githubusercont|...

kav avatar Sep 02 '21 21:09 kav

That is expected, as 1d is simply invalid.

There is no definition for units of Day or larger to avoid confusion across daylight savings time zone transitions.

  • https://pkg.go.dev/time#pkg-constants

A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".

  • https://pkg.go.dev/time#ParseDuration

hiddeco avatar Sep 02 '21 21:09 hiddeco

Yes sure, but it synchronized that change from the repository into the Helmrepository resource and then OOMed the source controller trying to read the helmrepo. I backed out the change in git but then had to manually edit the helmrepository object since the source controller was hung. Not saying it should support days just that that is a footgun. If it's not supported I would have expected the helmrepository to fail validation on the sync

kav avatar Sep 02 '21 21:09 kav

@kav can you please move this into a separate issue? I did a small test yesterday evening and was indeed able to apply a resource with an invalid interval format, but the cluster I was testing on wasn't running any controllers at the time so I wasn't able to validate the crash.

hiddeco avatar Sep 03 '21 13:09 hiddeco

Having the same issue with OOMKilled and with the information from #192 pinned it down to large helm-repo of bitnami with index-file alone having 13.4M

image

mkoertgen avatar May 12 '22 10:05 mkoertgen

For large Helm repository index files, you can enable caching to reduce the memory footprint of source-controller, docs here: https://fluxcd.io/docs/cheatsheets/bootstrap/#enable-helm-repositories-caching

stefanprodan avatar May 12 '22 10:05 stefanprodan

Thanks for the documentation link @stefanprodan. That was helpful.

Removing bitnami-helm-repos in redundant namespaces brought down the mem-footprint to 190M, yet still peaking every 10min (helm repo update interval)

image

I will check on enabling helm-caching. Thanks again, much appreciated.

mkoertgen avatar May 12 '22 11:05 mkoertgen

Needed to update 0.28 -> 0.30 so the helm-cache-arguments were available.

gotk_cache_events_total looks good so far. Will observe the mem-footprint but for now seems to solve the issue, at least for me.

Thanks again.

mkoertgen avatar May 12 '22 11:05 mkoertgen

Looks much better with helm-caching enabled

image

mkoertgen avatar May 12 '22 12:05 mkoertgen

Yeap that's consistent with what I'm seeing on my test clusters, using source-controller cache brought the memory from 2GB down to 200MB.

stefanprodan avatar May 12 '22 12:05 stefanprodan