source-controller OOMKill for period of time after which it magically resolves without changes

source-controller just got OOMKilled out of the blue. It started about an hour ago and I can't see any prior activity that triggered it. No recently added sources, nothing. It started with rc3. I upgraded it to rc4 but it's the same behaviour. The memory usage goes out of the roof and the cluster kills the pod.

❯❯❯ flux stats 
RECONCILERS             RUNNING FAILING SUSPENDED       STORAGE
GitRepository           7       0       0               1.9 MiB
OCIRepository           0       0       0               -
HelmRepository          0       0       0               -
HelmChart               0       0       0               -
Bucket                  0       0       0               -
Kustomization           3       0       0               -
HelmRelease             0       0       0               -
Alert                   0       0       0               -
Provider                0       0       0               -
Receiver                0       0       0               -
ImageUpdateAutomation   6       0       0               -
ImagePolicy             20      0       0               -
ImageRepository         20      0       0               -

❯❯❯ flux check
► checking prerequisites
✔ Kubernetes 1.22.17-eks-0a21954 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.34.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.34.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.28.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.0.0-rc.4
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.0.0-rc.4

flux check gets stuck at this point as the source controller is not responding.

May 31 '23 11:05 kalmarz

Does the time it takes to end up in a OOM kill allow you to collect a HEAP profile? Instructions for this can be found here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile

May 31 '23 11:05 hiddeco

heap.out.zip

I tried to catch it right before the kill.

May 31 '23 12:05 kalmarz

Same thing is happening with some of our clusters as well

May 31 '23 12:05 cwrau

@cwrau did this start with RC.3 as well?

In addition to this, did you run RC.2 or RC.1 before without issues?

Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.

May 31 '23 12:05 hiddeco

@cwrau did this start with RC.3 as well?

In addition to this, did you run RC.2 or RC.1 before without issues?

Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.

No, this seems to be kinda version-universal, we have old clusters with source-controller 0.18.0 which also got OOM'd

May 31 '23 12:05 cwrau

Is there any chance you are all making use of BitBucket and things have now returned to normal?

May 31 '23 12:05 hiddeco

Is there any chance you are all making use of BitBucket and things have now returned to normal?

No BitBucket for us that I can see;

HelmRepositories;

https://charts.jetstack.io
https://kubernetes.github.io/ingress-nginx
https://kyverno.github.io/kyverno/
https://charts.bitnami.com/bitnami
https://helm-charts.bitpoke.io
https://opensource.zalando.com/postgres-operator/charts/postgres-operator/
https://prometheus-community.github.io/helm-charts
https://vmware-tanzu.github.io/helm-charts/
https://grafana.github.io/helm-charts
https://aquasecurity.github.io/helm-charts/

GitRepositories;

ssh://[email protected]
ssh://[email protected]

(self hosted gitlab)

But yes, it seems to have stopped

May 31 '23 12:05 cwrau

Any chance your crashing instances are hosted on AWS EKS (AMD64, EC2, managed node groups)? I am trying to find a correlation here between your setups, as the issue seems to have started for a group of users at the same time, with a range of controller versions, and then magically stopped.

May 31 '23 12:05 hiddeco

Hello @hiddeco,

I'm with the same organization as @cwrau. No we don't use AWS. We use our own infrastructure based on OpenStack. We are also trying to find correlations. Especially since this came and went out of nothing.

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

May 31 '23 12:05 ghost

I see some bump in the DNS traffic for this period.

May 31 '23 12:05 kalmarz

Without this happening again, and a proper HEAP snapshot when this happens, I fear this will be very much like looking for a needle in a haystack.

May 31 '23 12:05 hiddeco

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

The clusters I know of just don't have a memory limit 😉

May 31 '23 12:05 cwrau

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

The clusters I know of just don't have a memory limit wink

Oh, you're right. The all don't. Ooops!

May 31 '23 12:05 ghost

I see some bump in the DNS traffic for this period.

Same thing happened on at least one of our clusters;

https://snapshots.raintank.io/dashboard/snapshot/6QEyFh33cs2tH6FYj8enOeY0rlbY7jvc

May 31 '23 12:05 cwrau

Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?

May 31 '23 12:05 hiddeco

Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?

Yeah, I was thinking the same thing but in my snapshot you can see that the responses take longer and the request and response sizes also got bigger, maybe that's got something to do with this?

May 31 '23 12:05 cwrau

It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.

Any chance you collect logs somewhere that survive pod restarts?

May 31 '23 13:05 hiddeco

It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.

Any chance you collect logs somewhere that survive pod restarts?

Of course; https://snapshots.raintank.io/dashboard/snapshot/o6vLv6rjgym3qTdZs2AH6c0m3CjQrIyl

Better dashboard; https://snapshots.raintank.io/dashboard/snapshot/WinIZkonEwYwQaYoh8fQ1NtW2NK2yWxp

May 31 '23 13:05 cwrau

Based on the following log lines (and alike), it seems that the whole cluster network in general was unstable at the time.

{"level":"error","ts":"2023-05-31T11:57:14.737Z","logger":"runtime","msg":"Failed to release lock: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:14.706Z","logger":"runtime","msg":"failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition\n"}
{"level":"info","ts":"2023-05-31T11:57:14.703Z","msg":"Wait completed, proceeding to shutdown the manager"}

{"level":"error","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta2.OCIRepository: failed to list *v1beta2.OCIRepository: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta2.OCIRepository: Unauthorized\n"}	
{"level":"error","ts":"2023-05-31T11:57:12.788Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta2.Bucket: failed to list *v1beta2.Bucket: Unauthorized\n"}

{"level":"error","ts":"2023-05-31T11:57:10.671Z","logger":"runtime","msg":"Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"grafana.17643873d2342f01\", GenerateName:\"\", Namespace:\"flux-system\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"HelmRepository\", Namespace:\"flux-system\", Name:\"grafana\", UID:\"8d83e45a-9757-4b0d-8452-53a4d9404aa1\", APIVersion:\"[source.toolkit.fluxcd.io/v1beta2](http://source.toolkit.fluxcd.io/v1beta2)\", ResourceVersion:\"62460920\", FieldPath:\"\"}, Reason:\"GarbageCollectionSucceeded\", Message:\"garbage collected artifacts for deleted resource\", Source:v1.EventSource{Component:\"source-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), LastTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'events \"grafana.17643873d2342f01\" is forbidden: unable to create new content in namespace flux-system because it is being terminated' (will not retry!)\n"}

This by itself does not explain the OOMKill, but may lead to pointers elsewhere (CoreDNS, Kubernetes API server?) which explain how we can replicate this.

May 31 '23 13:05 hiddeco

Mh, it just occurred to me, that that cluster might not be the best example, so let's instead take a look at this one;

CoreDNS; https://snapshots.raintank.io/dashboard/snapshot/v7J2X7eC7xHsSXGRdRjJ3pXiTe6kja4r Logs; https://snapshots.raintank.io/dashboard/snapshot/eTRez5Xo65uZEaMwtoJ1Qh7MqwvSBUyv Resources; https://snapshots.raintank.io/dashboard/snapshot/ZacWHodjq5khruJB8V4kVxxJA7lnll3w

Sorrey 😅

May 31 '23 13:05 cwrau

we see the same on our 15+ Clusters in AWS EKS - Karpenter Nodes OOM137- we running latest version before RC versions

what helps to find the problem ? Heap dump ?

Jun 19 '23 19:06 haarchri

Here are two heap maps of the same pod, there are about 3h30min between these two: heap.zip

The memory consumption increased from about 290MB to 2240MB in this time

Jun 27 '23 09:06 akesser

Here is a heap map of the fiftieth incarnation of the pod in a crash loop where ist consumes about 12.4 GB of memory files.zip

Jun 28 '23 06:06 akesser

Please share the output from flux stats and flux version.

Generally speaking, it is recommended to update to newest versions even if they are release candidates. As we also documented in our release notes.

In addition, I can see a lot of your memory usage @akesser is coming from the parsing of Helm repository indexes. It may be worth looking at https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-repositories-caching to see if this prevents the issue from happening.

Jun 29 '23 19:06 hiddeco

We updated to the newest version of flux. With the old version, the error occurred every 24 hours, now it occurs every 5 to 10 minutes.

And we use helm repository caching:

--helm-cache-max-size=2000
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m
--helm-index-max-size=64428800

Jul 17 '23 19:07 akesser

This is the overview of flux stats

RECONCILERS          	RUNNING	FAILING	SUSPENDED	STORAGE   
GitRepository        	4      	0      	0        	908.9 KiB	
OCIRepository        	1      	0      	0        	498 B    	
HelmRepository       	30     	0      	0        	81.0 MiB 	
HelmChart            	221    	2      	0        	5.3 MiB  	
Bucket               	0      	0      	0        	-        	
Kustomization        	53     	7      	0        	-        	
HelmRelease          	87     	5      	0        	-        	
Alert                	0      	0      	0        	-        	
Provider             	0      	0      	0        	-        	
Receiver             	0      	0      	0        	-        	
ImageUpdateAutomation	0      	0      	0        	-        	
ImagePolicy          	0      	0      	0        	-        	
ImageRepository      	0      	0      	0        	-

The old version consumed about 5GB of ram 0.4 cores, the following images shows cpu and ram of the newer version of flux:

It consumes 83GB of RAM and 17.8 cores.

Jul 17 '23 21:07 akesser

And here is a heap dump: heap2.out.zip

Jul 18 '23 05:07 akesser

here is an other screenshot showing the pod consuming 44.6 CPU and 144GB of RAM: Bildschirmfoto 2023-07-18 um 06 52 13

Jul 18 '23 06:07 akesser

One entry for gotk_cache_events_total that comes to one eyes is this: Bildschirmfoto 2023-07-18 um 10 16 45 All other entries are below 500 This is the graph of the counter:

Jul 18 '23 09:07 akesser

source-controller source-controller copied to clipboard

OOMKill for period of time after which it magically resolves without changes

source-controller
source-controller copied to clipboard