source-controller
source-controller copied to clipboard
OOMKill for period of time after which it magically resolves without changes
source-controller just got OOMKilled out of the blue. It started about an hour ago and I can't see any prior activity that triggered it. No recently added sources, nothing. It started with rc3. I upgraded it to rc4 but it's the same behaviour. The memory usage goes out of the roof and the cluster kills the pod.
❯❯❯ flux stats
RECONCILERS RUNNING FAILING SUSPENDED STORAGE
GitRepository 7 0 0 1.9 MiB
OCIRepository 0 0 0 -
HelmRepository 0 0 0 -
HelmChart 0 0 0 -
Bucket 0 0 0 -
Kustomization 3 0 0 -
HelmRelease 0 0 0 -
Alert 0 0 0 -
Provider 0 0 0 -
Receiver 0 0 0 -
ImageUpdateAutomation 6 0 0 -
ImagePolicy 20 0 0 -
ImageRepository 20 0 0 -
❯❯❯ flux check
► checking prerequisites
✔ Kubernetes 1.22.17-eks-0a21954 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.34.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.34.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.28.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.0.0-rc.4
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.0.0-rc.4
flux check gets stuck at this point as the source controller is not responding.
Does the time it takes to end up in a OOM kill allow you to collect a HEAP profile? Instructions for this can be found here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile
Same thing is happening with some of our clusters as well
@cwrau did this start with RC.3 as well?
In addition to this, did you run RC.2 or RC.1 before without issues?
Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.
@cwrau did this start with RC.3 as well?
In addition to this, did you run RC.2 or RC.1 before without issues?
Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.
No, this seems to be kinda version-universal, we have old clusters with source-controller 0.18.0 which also got OOM'd
Is there any chance you are all making use of BitBucket and things have now returned to normal?
Is there any chance you are all making use of BitBucket and things have now returned to normal?
No BitBucket for us that I can see;
HelmRepositories;
https://charts.jetstack.io
https://kubernetes.github.io/ingress-nginx
https://kyverno.github.io/kyverno/
https://charts.bitnami.com/bitnami
https://helm-charts.bitpoke.io
https://opensource.zalando.com/postgres-operator/charts/postgres-operator/
https://prometheus-community.github.io/helm-charts
https://vmware-tanzu.github.io/helm-charts/
https://grafana.github.io/helm-charts
https://aquasecurity.github.io/helm-charts/
GitRepositories;
ssh://[email protected]
ssh://[email protected]
(self hosted gitlab)
But yes, it seems to have stopped
Any chance your crashing instances are hosted on AWS EKS (AMD64, EC2, managed node groups)? I am trying to find a correlation here between your setups, as the issue seems to have started for a group of users at the same time, with a range of controller versions, and then magically stopped.
Hello @hiddeco,
I'm with the same organization as @cwrau. No we don't use AWS. We use our own infrastructure based on OpenStack. We are also trying to find correlations. Especially since this came and went out of nothing.
Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.
I see some bump in the DNS traffic for this period.
Without this happening again, and a proper HEAP snapshot when this happens, I fear this will be very much like looking for a needle in a haystack.
Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.
The clusters I know of just don't have a memory limit 😉
Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.
The clusters I know of just don't have a memory limit wink
Oh, you're right. The all don't. Ooops!
I see some bump in the DNS traffic for this period.
![]()
Same thing happened on at least one of our clusters;
https://snapshots.raintank.io/dashboard/snapshot/6QEyFh33cs2tH6FYj8enOeY0rlbY7jvc
Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?
Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?
Yeah, I was thinking the same thing but in my snapshot you can see that the responses take longer and the request and response sizes also got bigger, maybe that's got something to do with this?
It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.
Any chance you collect logs somewhere that survive pod restarts?
It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.
Any chance you collect logs somewhere that survive pod restarts?
Of course; https://snapshots.raintank.io/dashboard/snapshot/o6vLv6rjgym3qTdZs2AH6c0m3CjQrIyl
Better dashboard; https://snapshots.raintank.io/dashboard/snapshot/WinIZkonEwYwQaYoh8fQ1NtW2NK2yWxp
Based on the following log lines (and alike), it seems that the whole cluster network in general was unstable at the time.
{"level":"error","ts":"2023-05-31T11:57:14.737Z","logger":"runtime","msg":"Failed to release lock: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:14.706Z","logger":"runtime","msg":"failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition\n"}
{"level":"info","ts":"2023-05-31T11:57:14.703Z","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta2.OCIRepository: failed to list *v1beta2.OCIRepository: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta2.OCIRepository: Unauthorized\n"}
{"level":"error","ts":"2023-05-31T11:57:12.788Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta2.Bucket: failed to list *v1beta2.Bucket: Unauthorized\n"}
{"level":"error","ts":"2023-05-31T11:57:10.671Z","logger":"runtime","msg":"Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"grafana.17643873d2342f01\", GenerateName:\"\", Namespace:\"flux-system\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"HelmRepository\", Namespace:\"flux-system\", Name:\"grafana\", UID:\"8d83e45a-9757-4b0d-8452-53a4d9404aa1\", APIVersion:\"[source.toolkit.fluxcd.io/v1beta2](http://source.toolkit.fluxcd.io/v1beta2)\", ResourceVersion:\"62460920\", FieldPath:\"\"}, Reason:\"GarbageCollectionSucceeded\", Message:\"garbage collected artifacts for deleted resource\", Source:v1.EventSource{Component:\"source-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), LastTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'events \"grafana.17643873d2342f01\" is forbidden: unable to create new content in namespace flux-system because it is being terminated' (will not retry!)\n"}
This by itself does not explain the OOMKill, but may lead to pointers elsewhere (CoreDNS, Kubernetes API server?) which explain how we can replicate this.
Mh, it just occurred to me, that that cluster might not be the best example, so let's instead take a look at this one;
CoreDNS; https://snapshots.raintank.io/dashboard/snapshot/v7J2X7eC7xHsSXGRdRjJ3pXiTe6kja4r Logs; https://snapshots.raintank.io/dashboard/snapshot/eTRez5Xo65uZEaMwtoJ1Qh7MqwvSBUyv Resources; https://snapshots.raintank.io/dashboard/snapshot/ZacWHodjq5khruJB8V4kVxxJA7lnll3w
Sorrey 😅
we see the same on our 15+ Clusters in AWS EKS - Karpenter Nodes OOM137- we running latest version before RC versions
what helps to find the problem ? Heap dump ?
Here are two heap maps of the same pod, there are about 3h30min between these two: heap.zip
The memory consumption increased from about 290MB to 2240MB in this time
Here is a heap map of the fiftieth incarnation of the pod in a crash loop where ist consumes about 12.4 GB of memory files.zip
Please share the output from flux stats and flux version.
Generally speaking, it is recommended to update to newest versions even if they are release candidates. As we also documented in our release notes.
In addition, I can see a lot of your memory usage @akesser is coming from the parsing of Helm repository indexes. It may be worth looking at https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-repositories-caching to see if this prevents the issue from happening.
We updated to the newest version of flux. With the old version, the error occurred every 24 hours, now it occurs every 5 to 10 minutes.
And we use helm repository caching:
--helm-cache-max-size=2000
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m
--helm-index-max-size=64428800
This is the overview of flux stats
RECONCILERS RUNNING FAILING SUSPENDED STORAGE
GitRepository 4 0 0 908.9 KiB
OCIRepository 1 0 0 498 B
HelmRepository 30 0 0 81.0 MiB
HelmChart 221 2 0 5.3 MiB
Bucket 0 0 0 -
Kustomization 53 7 0 -
HelmRelease 87 5 0 -
Alert 0 0 0 -
Provider 0 0 0 -
Receiver 0 0 0 -
ImageUpdateAutomation 0 0 0 -
ImagePolicy 0 0 0 -
ImageRepository 0 0 0 -
The old version consumed about 5GB of ram 0.4 cores, the following images shows cpu and ram of the newer version of flux:
It consumes 83GB of RAM and 17.8 cores.
And here is a heap dump: heap2.out.zip
here is an other screenshot showing the pod consuming 44.6 CPU and 144GB of RAM:
One entry for gotk_cache_events_total that comes to one eyes is this:
All other entries are below 500
This is the graph of the counter: