controller-runtime icon indicating copy to clipboard operation
controller-runtime copied to clipboard

Context cancellation causes a goroutine leak (possibly intermittent)

Open pmalek opened this issue 7 months ago • 3 comments

Problem statement

When running our integration tests suite with https://github.com/uber-go/goleak, it sometimes reports a goroutine leak in:

 manager_envtest_test.go:84: found unexpected goroutines:
          [Goroutine 12886 in state sync.WaitGroup.Wait, with sync.runtime_SemacquireWaitGroup on top of the stack:
          sync.runtime_SemacquireWaitGroup(0xc000cdc830?)
          	/home/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/sema.go:110 +0x25
          sync.(*WaitGroup).Wait(0xc000cdc830)
          	/home/runner/go/pkg/mod/golang.org/[email protected]/src/sync/waitgroup.go:118 +0x89
          sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func3.(*runnableGroup).StopAndWait.3.2()
          	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:307 +0x99
          created by sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func3.(*runnableGroup).StopAndWait.3 in goroutine 12666
          	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:304 +0x213
           Goroutine 12791 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:
          sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc00115d940)
          	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:227 +0x1f8
          created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 12615
          	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:210 +0x245
           Goroutine 12838 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:
          sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc00115dd00)
          	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:227 +0x1f8
          created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 12615
          	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:210 +0x245
          ]

After quick glance at the code, it would seem that controllerManager passes its errChan to its runnableGroups.

When runnables fail, they propagate the error through that error channel in https://github.com/kubernetes-sigs/controller-runtime/blob/dcb8a6a8d7ed700d4590f11fac171f766f54ea3c/pkg/manager/runnable_group.go#L227.

If I'm looking at this correctly, when context is done, the errChan is not drained in https://github.com/kubernetes-sigs/controller-runtime/blob/5af1f3ebd472b62a4a708ad3aa2d252489b91b27/pkg/manager/internal.go#L467-L469 and thus runnable groups cannot send their errors in https://github.com/kubernetes-sigs/controller-runtime/blob/dcb8a6a8d7ed700d4590f11fac171f766f54ea3c/pkg/manager/runnable_group.go#L224.

Proposed solution

Drain the cm.errChan in case <-ctx.Done() branch at https://github.com/kubernetes-sigs/controller-runtime/blob/5af1f3ebd472b62a4a708ad3aa2d252489b91b27/pkg/manager/internal.go#L466-L473

pmalek avatar May 15 '25 10:05 pmalek

Hi @pmalek, are you still seeing this issue? Which controller-runtime version are you using? I did some digging into the codebase. And it looks to me that the errChan is properly drained here: https://github.com/kubernetes-sigs/controller-runtime/blob/b9a9ca01fd37a66225eb4793544349ee2ed1f196/pkg/manager/internal.go#L510

jingyih avatar Jun 26 '25 01:06 jingyih

I've moved on from this problem and added an exclusion for controller-runtime goroutines in the goroutine leak test: https://github.com/Kong/kubernetes-ingress-controller/blob/a32efd26867e255df611aa156ed9fcda291449f2/test/envtest/manager_envtest_test.go#L84-L87.

As you can see in the snippet that I've added in the description, at that time I was using v0.20.4.

Just a thought:

Would it be possible to receive errors on cm.errChan after stopComplete gets closed and returns in https://github.com/kubernetes-sigs/controller-runtime/blob/b9a9ca01fd37a66225eb4793544349ee2ed1f196/pkg/manager/internal.go#L514?

pmalek avatar Jun 26 '25 14:06 pmalek

good question

I've moved on from this problem and added an exclusion for controller-runtime goroutines in the goroutine leak test: https://github.com/Kong/kubernetes-ingress-controller/blob/a32efd26867e255df611aa156ed9fcda291449f2/test/envtest/manager_envtest_test.go#L84-L87.

As you can see in the snippet that I've added in the description, at that time I was using v0.20.4.

Just a thought:

Would it be possible to receive errors on cm.errChan after stopComplete gets closed and returns in

controller-runtime/pkg/manager/internal.go

Line 514 in b9a9ca0

case <-stopComplete: ?

Good pointer. I think it is possible that stopComplete closes before all runnables stop. After taking a closer look, the race depends on the GracefulShutdownTimeout—which defaults to 30 seconds—and how long the runnable takes to return an error after being signaled to stop. Did you set a custom value for GracefulShutdownTimeout?

https://github.com/kubernetes-sigs/controller-runtime/blob/cacd627d0b9bc46b592cf5f96f57f1308313618e/pkg/manager/manager.go#L270

jingyih avatar Jun 26 '25 21:06 jingyih

We do set it to 0 for the tests where we observed the error: https://github.com/Kong/kubernetes-ingress-controller/blob/dd6538bf66f2e27805786fd6a21a7a93475f0323/test/envtest/run.go#L56-L59

pmalek avatar Jul 25 '25 17:07 pmalek