Context cancellation causes a goroutine leak (possibly intermittent)
Problem statement
When running our integration tests suite with https://github.com/uber-go/goleak, it sometimes reports a goroutine leak in:
manager_envtest_test.go:84: found unexpected goroutines:
[Goroutine 12886 in state sync.WaitGroup.Wait, with sync.runtime_SemacquireWaitGroup on top of the stack:
sync.runtime_SemacquireWaitGroup(0xc000cdc830?)
/home/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/sema.go:110 +0x25
sync.(*WaitGroup).Wait(0xc000cdc830)
/home/runner/go/pkg/mod/golang.org/[email protected]/src/sync/waitgroup.go:118 +0x89
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func3.(*runnableGroup).StopAndWait.3.2()
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:307 +0x99
created by sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func3.(*runnableGroup).StopAndWait.3 in goroutine 12666
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:304 +0x213
Goroutine 12791 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc00115d940)
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:227 +0x1f8
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 12615
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:210 +0x245
Goroutine 12838 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc00115dd00)
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:227 +0x1f8
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 12615
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:210 +0x245
]
After quick glance at the code, it would seem that controllerManager passes its errChan to its runnableGroups.
When runnables fail, they propagate the error through that error channel in https://github.com/kubernetes-sigs/controller-runtime/blob/dcb8a6a8d7ed700d4590f11fac171f766f54ea3c/pkg/manager/runnable_group.go#L227.
If I'm looking at this correctly, when context is done, the errChan is not drained in https://github.com/kubernetes-sigs/controller-runtime/blob/5af1f3ebd472b62a4a708ad3aa2d252489b91b27/pkg/manager/internal.go#L467-L469 and thus runnable groups cannot send their errors in https://github.com/kubernetes-sigs/controller-runtime/blob/dcb8a6a8d7ed700d4590f11fac171f766f54ea3c/pkg/manager/runnable_group.go#L224.
Proposed solution
Drain the cm.errChan in case <-ctx.Done() branch at https://github.com/kubernetes-sigs/controller-runtime/blob/5af1f3ebd472b62a4a708ad3aa2d252489b91b27/pkg/manager/internal.go#L466-L473
Hi @pmalek, are you still seeing this issue? Which controller-runtime version are you using? I did some digging into the codebase. And it looks to me that the errChan is properly drained here: https://github.com/kubernetes-sigs/controller-runtime/blob/b9a9ca01fd37a66225eb4793544349ee2ed1f196/pkg/manager/internal.go#L510
I've moved on from this problem and added an exclusion for controller-runtime goroutines in the goroutine leak test: https://github.com/Kong/kubernetes-ingress-controller/blob/a32efd26867e255df611aa156ed9fcda291449f2/test/envtest/manager_envtest_test.go#L84-L87.
As you can see in the snippet that I've added in the description, at that time I was using v0.20.4.
Just a thought:
Would it be possible to receive errors on cm.errChan after stopComplete gets closed and returns in https://github.com/kubernetes-sigs/controller-runtime/blob/b9a9ca01fd37a66225eb4793544349ee2ed1f196/pkg/manager/internal.go#L514?
good question
I've moved on from this problem and added an exclusion for controller-runtime goroutines in the goroutine leak test: https://github.com/Kong/kubernetes-ingress-controller/blob/a32efd26867e255df611aa156ed9fcda291449f2/test/envtest/manager_envtest_test.go#L84-L87.
As you can see in the snippet that I've added in the description, at that time I was using v0.20.4.
Just a thought:
Would it be possible to receive errors on
cm.errChanafterstopCompletegets closed and returns incontroller-runtime/pkg/manager/internal.go
Line 514 in b9a9ca0
case <-stopComplete: ?
Good pointer. I think it is possible that stopComplete closes before all runnables stop. After taking a closer look, the race depends on the GracefulShutdownTimeout—which defaults to 30 seconds—and how long the runnable takes to return an error after being signaled to stop. Did you set a custom value for GracefulShutdownTimeout?
https://github.com/kubernetes-sigs/controller-runtime/blob/cacd627d0b9bc46b592cf5f96f57f1308313618e/pkg/manager/manager.go#L270
We do set it to 0 for the tests where we observed the error: https://github.com/Kong/kubernetes-ingress-controller/blob/dd6538bf66f2e27805786fd6a21a7a93475f0323/test/envtest/run.go#L56-L59