source-controller
source-controller copied to clipboard
source-controller OOM events
Describe the bug
When registering FluxCD to a repository in gitlab enterprise, I am seeing OOM activity on the source-controller
pod. Removing the 1GB memory limit fixes the issues.
To Reproduce
Register fluxcd on a repo with some level of complexity, I believe.
Expected behavior
The source-controller pod should not be killed and restarted repeatedly.
Additional context
- Kubernetes version: 1.19
- Git provider: gitlab self-hosted
- Container registry provider: gitlab/ECR
Below please provide the output of the following commands:
flux --version : flux version 0.8.0
flux check
► checking prerequisites
✔ kubectl 1.19.3 >=1.18.0
✔ Kubernetes 1.19.6-eks-49a6c0 >=1.16.0
► checking controllers
✔ source-controller: healthy
► ghcr.io/fluxcd/source-controller:v0.8.1
✔ kustomize-controller: healthy
► ghcr.io/fluxcd/kustomize-controller:v0.8.1
✔ helm-controller: healthy
► ghcr.io/fluxcd/helm-controller:v0.7.0
✔ notification-controller: healthy
► ghcr.io/fluxcd/notification-controller:v0.8.0
✔ all checks passed
kubectl -n <namespace> get all
kubectl -n flux-system get all
NAME READY STATUS RESTARTS AGE
pod/helm-controller-6946b6dc7f-5nr8q 1/1 Running 0 9m34s
pod/kustomize-controller-55dfcdfd58-xj25c 1/1 Running 0 10h
pod/notification-controller-649754966b-2677x 1/1 Running 0 10h
pod/source-controller-597cc769b-lp6w4 0/1 CrashLoopBackOff 5 6m23s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/notification-controller ClusterIP 10.100.114.245 <none> 80/TCP 10h
service/source-controller ClusterIP 10.100.185.20 <none> 80/TCP 10h
service/webhook-receiver ClusterIP 10.100.198.200 <none> 80/TCP 10h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/helm-controller 1/1 1 1 10h
deployment.apps/kustomize-controller 1/1 1 1 10h
deployment.apps/notification-controller 1/1 1 1 10h
deployment.apps/source-controller 0/1 1 0 10h
NAME DESIRED CURRENT READY AGE
replicaset.apps/helm-controller-6779d46d69 0 0 0 10h
replicaset.apps/helm-controller-6946b6dc7f 1 1 1 9m34s
replicaset.apps/kustomize-controller-55dfcdfd58 1 1 1 10h
replicaset.apps/notification-controller-649754966b 1 1 1 10h
replicaset.apps/source-controller-555d4f9d6 0 0 0 10h
replicaset.apps/source-controller-597cc769b 1 1 0 10h
kubectl -n <namespace> logs deploy/source-controller
-- various without errors until killed ---
kubectl -n <namespace> logs deploy/kustomize-controller
-- various ---
level":"info","ts":"2021-02-24T00:06:40.724Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"istio-system","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.811Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.815Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"error","ts":"2021-02-24T00:06:41.825Z","logger":"controller.kustomization","msg":"Reconciliation failed after 1.059192016s, next try in 5m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"podinfo","namespace":"flux-system","revision"
:"master/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7","error":"failed to download artifact from http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz, error: Get \"http://source-controller.flux-system.svc.cl
uster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz\": dial tcp 10.100.185.20:80: connect: connection refused"}
{"level":"info","ts":"2021-02-24T00:06:41.843Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.833Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.834Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.855Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.863Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.872Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.874Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.875Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.893Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
Changing the source-controller deployment resources stanza as follows:
resources:
limits:
cpu: 1000m
#memory: 1Gi
requests:
cpu: 50m
#memory: 64Mi
addresses the issue
I had the same issue but this time increasing the memory limts to 2Gi did mitigate the issue
I am seeing OOMs with 2Gi
and I am on v0.14.1
.
Same here on flux2 version 0.16.2
. Increasing the memory limits to 2Gi mitigated the issue.
This issue seems to be linked to: https://github.com/fluxcd/source-controller/issues/192 Our clusters also suffer from this issue, we see memory usages of 1-2GB.
Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.
I was able to trigger this issue by putting interval: 1d
in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM
As with any workload on Kubernetes, the right resource limit configuration highly depends on what you are making the source-controller do (and you may thus have to increase it).
Helm related operations for example, are resource intensive because at present we haven't found a right optimization path to work with repository index files without loading them in memory in full (due to certain constraints around the unmarshalling of YAML).
Combined with the popularity of some solutions like Artifactory, which likes to stuff as much as possible in a single index (in some cases resulting in a file of >100MB), and the fact that the reconciliation of resources is isolated, resource usage exceeding the defaults can be expected.
Another task that can be resource intensive is the packaging of a Helm chart from a Git source, because Helm first loads all the chart data into an object in memory (including all files, and the files of the dependencies), before writing it to disk.
For a fun experiment: check the current resources your CI worker nodes have (or ask around), or monitor the resource usage of various helm
commands on your local machine, and then take into account that the controller does this in parallel with multiple workers, for multiple resources.
Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.
The controller does much more than just downloading files, and I think you are oversimplifying or underestimating the inner workings of the controller, and ignoring the fact that it has several features that perform composition tasks, etc. In addition, to ensure proper isolation of e.g. credentials, most Git things are done in memory as well.
I was able to trigger this issue by putting
interval: 1d
in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM
Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.
Lastly, we are continuously looking into ways to reduce the footprint of our controllers, and I can already tell you some paths have been identified (and are actively worked on) to help reduce it.
Do however always keep in mind that while the YAML creates simple looking and composable abstractions, there will always be processes behind it that actually execute the task, and that the hardware of your local development machine often outperforms most containers.
Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.
No, it appears 1d
is simply not valid per the log. Sorry should have included that
E0902 19:20:30.626842 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1beta1.HelmRepository: failed to list *v1beta1.HelmRepository: v1beta1.HelmRepositoryList.Items: []v1beta1.HelmRepository: v1beta1.HelmRepository.Spec: v1beta1.HelmRepositorySpec.Timeout: Interval: unmarshalerDecoder: time: unknown unit "d" in duration "1d", error found in #10 byte of ...|rval":"1d","timeout"|..., bigger context ...|0-4596-8543-9d6d4b573433"},"spec":{"interval":"1d","timeout":"60s","url":"https://raw.githubusercont|...
That is expected, as 1d
is simply invalid.
There is no definition for units of Day or larger to avoid confusion across daylight savings time zone transitions.
- https://pkg.go.dev/time#pkg-constants
A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
- https://pkg.go.dev/time#ParseDuration
Yes sure, but it synchronized that change from the repository into the Helmrepository resource and then OOMed the source controller trying to read the helmrepo. I backed out the change in git but then had to manually edit the helmrepository object since the source controller was hung. Not saying it should support days just that that is a footgun. If it's not supported I would have expected the helmrepository to fail validation on the sync
@kav can you please move this into a separate issue? I did a small test yesterday evening and was indeed able to apply a resource with an invalid interval
format, but the cluster I was testing on wasn't running any controllers at the time so I wasn't able to validate the crash.
Having the same issue with OOMKilled and with the information from #192 pinned it down to large helm-repo of bitnami with index-file alone having 13.4M
For large Helm repository index files, you can enable caching to reduce the memory footprint of source-controller, docs here: https://fluxcd.io/docs/cheatsheets/bootstrap/#enable-helm-repositories-caching
Thanks for the documentation link @stefanprodan. That was helpful.
Removing bitnami-helm-repos in redundant namespaces brought down the mem-footprint to 190M, yet still peaking every 10min (helm repo update interval)
I will check on enabling helm-caching. Thanks again, much appreciated.
Needed to update 0.28 -> 0.30 so the helm-cache-arguments were available.
gotk_cache_events_total
looks good so far. Will observe the mem-footprint but for now seems to solve the issue, at least for me.
Thanks again.
Looks much better with helm-caching enabled
Yeap that's consistent with what I'm seeing on my test clusters, using source-controller cache brought the memory from 2GB down to 200MB.