kcp icon indicating copy to clipboard operation
kcp copied to clipboard

bug: Creating and then deleting workspaces causes leaked goroutines

Open julian-hj opened this issue 8 months ago • 13 comments

Describe the bug

baseline pprof from kcp before workspace creation/deletion: pre-ws.pprof.txt

pprof after creating and then deleting an org workspace 100 times: post-ws.pprof.txt

go tool rendering of pprof:

Image

Steps To Reproduce

  1. Patch kcp to enable pprof addpprof.patch
  2. Start kcp in tilt and set the kubeconfig to the kind cluster
  3. port forward pprof port: kubectl port-forward -n kcp-alpha deployment/alpha 6060:6060 &
  4. get a baseline pprof: http://localhost:6060/debug/pprof/goroutine > pre-ws.pprof
  5. create a simple yaml file for an organization workspace: org-ws.yaml.txt
  6. target kcp, and then create and delete the same org 100 times in a loop: for i in {1..100}; do k apply -f org-ws.yaml; sleep 3; k delete -f org-ws.yaml; sleep 3; done
  7. collect pprof again and observe the large number of goroutines.

Expected Behaviour

Fewer goroutines

Additional Context

This looks related to #3016 although that issue is long since fixed, and didn't require new workspaces.

As far as I can tell, creating a new workspace causes the creation of listeners that wont get cancelled until the sharedIndexInformer is terminated.

julian-hj avatar Mar 25 '25 21:03 julian-hj

Hi @julian-hj, thanks for reporting this issue in such detail! We'll take a look at fixing this.

embik avatar Mar 26 '25 08:03 embik

@embik Could you please assign the issue to me? Thanks!

ntnn avatar Mar 27 '25 10:03 ntnn

/assign @ntnn

Thanks!

embik avatar Mar 27 '25 11:03 embik

Repro was straightforward with the mentioned steps from @julian-hj . Below are some of my observations from the collected pprofs.

There is a significant spike in Goroutine Count of ~4.6x

  • Pre-workspace creation/deletion: 3,392 goroutines
  • Post-workspace creation/deletion: 15,592 goroutines

Major Contributors to Goroutine Growth deducted from the pprof:

  1. The informer system shows significant growth. An increase from 285 (8.40%) to 6,385 (40.95%)
[github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).pop](http://github.com/kcp-dev/apimachinery/v2/third_party/informers.%28*processorListener%29.pop)
[github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run](http://github.com/kcp-dev/apimachinery/v2/third_party/informers.%28*processorListener%29.run)
[github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run.func1](http://github.com/kcp-dev/apimachinery/v2/third_party/informers.%28*processorListener%29.run.func1)
  1. The wait group & backoff utilities show substantial growth
[k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1](http://k8s.io/apimachinery/pkg/util/wait.%28*Group%29.Start.func1) increased from 726 (21.40%) to 12,926 (82.90%)
[k8s.io/apimachinery/pkg/util/wait.BackoffUntil](http://k8s.io/apimachinery/pkg/util/wait.BackoffUntil) increased from 756 (22.29%) to 6,856 (43.97%)
[k8s.io/apimachinery/pkg/util/wait.JitterUntil](http://k8s.io/apimachinery/pkg/util/wait.JitterUntil) increased from 680 (20.05%) to 6,780 (43.48%)
  1. Channel operations show significant growth.
runtime.chanrecv increased from 715 (21.08%) to 6,815 (43.71%)
runtime.selectgo increased from 2,074 (61.14%) to 8,176 (52.44%)

sanjimoh avatar Mar 27 '25 12:03 sanjimoh

The core issue is that the controllers that are started for workspaces register with the shared informers. The goroutines coming out of these registrations are not context-aware, so they are not cancelled when the workspace is deleted and the controllers stopped through their context cancellation.

The controllers started for workspaces would have to de-register themselves from the informers for the goroutines to be closed correctly.

I confirmed this by labeling the goroutines of the different controllers, e.g. the kcp-kubequota controller starts a resource controller here: https://github.com/kcp-dev/kcp/blob/e4f61b1f56905973b14596a02da35c4f755e7504/pkg/reconciler/kubequota/kubequota_controller.go#L280

Which leads to here: https://github.com/kubernetes/kubernetes/blob/ef7d51e5ba77004cb95dc2127ed90f8bbfdabc72/pkg/controller/resourcequota/resource_quota_controller.go#L106

Which is then registering into the provided informer here: https://github.com/kubernetes/kubernetes/blob/ef7d51e5ba77004cb95dc2127ed90f8bbfdabc72/pkg/controller/resourcequota/resource_quota_controller.go#L128-L157

However this registration is never cleaned up, but should be part of the defer stack in the .Run method: https://github.com/kubernetes/kubernetes/blob/ef7d51e5ba77004cb95dc2127ed90f8bbfdabc72/pkg/controller/resourcequota/resource_quota_controller.go#L293-L296

Compare: https://github.com/ntnn/kubernetes/commit/416e466ee01c46cb27fdea5fdb21f95a1f96ad0d

And this is just one leak - the kcp-kubequota is actually leaking three goroutines; the other two come from the initial sync:

29 @ 0x102437fb8 0x1023cc274 0x1023cbe44 0x1038f6b04 0x1029d4460 0x1029d4330 0x1029d4260 0x1038f6aa0 0x1038f6a6d 0x1029d7818 0x102440c94
# labels: {"comment":"resourceQuotaController initial monitor sync", "controller":"kcp-kube-quota", "subcontroller":"kubequota", "workspace":"2j5lsf4ydtnftsi3"}
#	0x1038f6b03	github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run.func1+0x43	/Users/I567861/sap/kcp/playground/apimachinery/third_party/informers/shared_informer.go:716
#	0x1029d445f	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3f					/Users/I567861/sap/kcp/playground/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226
#	0x1029d432f	k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0x8f						/Users/I567861/sap/kcp/playground/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227
#	0x1029d425f	k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x7f						/Users/I567861/sap/kcp/playground/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204
#	0x1038f6a9f	k8s.io/apimachinery/pkg/util/wait.Until+0x5f							/Users/I567861/sap/kcp/playground/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
#	0x1038f6a6c	github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run+0x2c		/Users/I567861/sap/kcp/playground/apimachinery/third_party/informers/shared_informer.go:715
#	0x1029d7817	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x57					/Users/I567861/sap/kcp/playground/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72


29 @ 0x102437fb8 0x102413b5c 0x1038f68a4 0x1029d7818 0x102440c94
# labels: {"comment":"resourceQuotaController initial monitor sync", "controller":"kcp-kube-quota", "subcontroller":"kubequota", "workspace":"2j5lsf4ydtnftsi3"}
#	0x1038f68a3	github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).pop+0xd3	/Users/I567861/sap/kcp/playground/apimachinery/third_party/informers/shared_informer.go:686
#	0x1029d7817	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x57				/Users/I567861/sap/kcp/playground/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72

It's always 61 leaked goroutines per workspace, so I'd hazard a guess and say that this affects every kubernetes controller started by kcp controllers.

ntnn avatar Apr 02 '25 08:04 ntnn

The leftover leaks in kcb-kubequota come from basically the same thing - controllers being attached to informers without deregistering when they end: https://github.com/kubernetes/kubernetes/blob/ef7d51e5ba77004cb95dc2127ed90f8bbfdabc72/pkg/controller/resourcequota/resource_quota_monitor.go#L174

Compare: https://github.com/ntnn/kubernetes/commit/8adb6cf17ff8a1f0740b69cec17e8f5caee8adae

The best approach would probably be to fix all leftovers of event handlers.

ntnn avatar Apr 02 '25 12:04 ntnn

It seems that this actually was (almost) all the leaking goroutines:

Image

ntnn avatar Apr 04 '25 14:04 ntnn

To reproduce use the branch kcp3350 of my kcp fork: https://github.com/ntnn/kcp/tree/kcp3350 And rewrite k8s.io/kubernetes to branch kcp3350-kcp of my kubernetes fork: https://github.com/ntnn/kubernetes/tree/kcp3350-kcp

I'm gonna take a closer look at the pprof diff next week.

ntnn avatar Apr 04 '25 14:04 ntnn

Found another bug in client-go's workqueue. When a workqueue is shutdown (without draining) the .Get method actually keeps returning elements to drain the queue:

https://github.com/ntnn/kubernetes/commit/0a59de42c9680b53dfe778fec15546b913242514

Noticed that while working on the tests for the controller runtimes, because there goroutines kept popping up even though the queue and workers should be stopped.

Not directly related to this but could explain some of the goroutines still popping up.

ntnn avatar Apr 07 '25 09:04 ntnn

In addition to the fixes for upstream I'm thinking of maybe a WorkspacedSharedInformer, that registers its own handlers with the SharedInformer and maintains the handlers of the workspace controllers separately. If this informer is context cancelled the goroutines would be eliminated as well. Just an idea though, gonna look into implementing that tomorrow.

Another thing is a test to check for leaking goroutines generally to catch whenever this happens in the future. In combination with the WorkspacesSharedInformer this could check for both any leaking goroutines as well as how many leftover handlers the workspace informer cleaned up.

ntnn avatar Apr 08 '25 16:04 ntnn

It is interesting, so the controllers in kcp has a lifecycle than can be started and stopped and restarted again? that is what leads to goroutine leak, but in kube the controllers can start and then stop to die

aojea avatar Apr 09 '25 08:04 aojea

Exactly - and it doesn't show up because the leaked goroutines we are experiencing is from the informer which for us lives longer than those controllers, so the handlers keep running. For kube the informer is shutdown at the same time as everything else so the leaks aren't showing up.

This leakage would also occur if kube would start to restart controllers (and partially does so for QuotaMonitors, although that only happens if resources are removed).

ntnn avatar Apr 09 '25 08:04 ntnn

Amazing work @ntnn!

stevekuznetsov avatar Apr 10 '25 00:04 stevekuznetsov

Blocked by kcp-dev/client-go#47 After the update is done ClusterWithContext can be implemented to prevent the leaks.

ntnn avatar Jun 30 '25 14:06 ntnn