Goroutine leak from controller.(*groupCostController).handleTokenBucketUpdateEvent
Bug Report
We noticed an issue where RG goroutines don't get cleaned up properly when the traffic ramp down:
goroutine 5042535512 [select, 1540 minutes]:
github.com/tikv/pd/client/resource_group/controller.(*groupCostController).handleTokenBucketUpdateEvent(0xc023ad4000?, {0x679abb0, 0xc002803040}, {0xc1d66f62a0?, 0xc0e3235560?, 0x9662f80?})
.../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
.../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
.../resource_group/controller/controller.go:251 +0x906
We saw thousands of tikv:8252 errors during that window and every time this happens, the TiDB server's goroutine goes up by 100K, and all goroutines are blocked on the above code. This leads to high CPU usage on the TiDB node.
What did you do?
- Create a RG group with minimal quota
RU_PER_SEC=1 - Send 10K QPS using this RG group
What did you expect to see?
TiDB server's goroutine should be stable. When the traffic ramp down it should recover to lower level.
What did you see instead?
A lot of goroutines from RG package sticked around and didn't get cleaned up.
What version of PD are you using (pd-server -V)?
trace is provided for v7.1.0 but the same can be reproduced on our v8.5.2 cluster.
The issue seems to be that:
- Whenever a new response comes in for RU request, https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1190 is executed and inits/resets the timer and channel for that request and its subsequent gc goroutine in https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1240 and https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1270
- Later the
groupControlleritself gets pushed into a channel to create gc goroutines: https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1277 - Because the RG is short of quota, TiDB kept asking for RUs from PD, and sends many more RU requests. Subsequent RU responses come in, and goes to
modifyTokenCounterandinitCounterNotifyagain, whereinitCounterNotifystops the timer held by the gc goroutines created in step 2: https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1289 Timer.Stopdoes not close/end the channel: https://pkg.go.dev/time#Timer.Stop- As a result, the gc goroutines are blocked on https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1070 which will never be signaled
Given timer's channel is receiving only, we could add a separate channel to signal the GC goroutines when things are getting closed/stopped.
@nolouch @rleungx WDYT? Feel free to forward to other reviewers/owners of the module. Thanks!
/cc @JmPotato
@JmPotato PTAL https://github.com/tikv/pd/pull/9749 Thanks!