pd icon indicating copy to clipboard operation
pd copied to clipboard

Goroutine leak from controller.(*groupCostController).handleTokenBucketUpdateEvent

Open yzhan1 opened this issue 3 months ago • 4 comments

Bug Report

We noticed an issue where RG goroutines don't get cleaned up properly when the traffic ramp down:

goroutine 5042535512 [select, 1540 minutes]:

github.com/tikv/pd/client/resource_group/controller.(*groupCostController).handleTokenBucketUpdateEvent(0xc023ad4000?, {0x679abb0, 0xc002803040}, {0xc1d66f62a0?, 0xc0e3235560?, 0x9662f80?})
      .../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
      .../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
      .../resource_group/controller/controller.go:251 +0x906

We saw thousands of tikv:8252 errors during that window and every time this happens, the TiDB server's goroutine goes up by 100K, and all goroutines are blocked on the above code. This leads to high CPU usage on the TiDB node.

What did you do?

  1. Create a RG group with minimal quota RU_PER_SEC=1
  2. Send 10K QPS using this RG group

What did you expect to see?

TiDB server's goroutine should be stable. When the traffic ramp down it should recover to lower level.

What did you see instead?

A lot of goroutines from RG package sticked around and didn't get cleaned up.

What version of PD are you using (pd-server -V)?

trace is provided for v7.1.0 but the same can be reproduced on our v8.5.2 cluster.

yzhan1 avatar Sep 16 '25 19:09 yzhan1

The issue seems to be that:

  1. Whenever a new response comes in for RU request, https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1190 is executed and inits/resets the timer and channel for that request and its subsequent gc goroutine in https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1240 and https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1270
  2. Later the groupController itself gets pushed into a channel to create gc goroutines: https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1277
  3. Because the RG is short of quota, TiDB kept asking for RUs from PD, and sends many more RU requests. Subsequent RU responses come in, and goes to modifyTokenCounter and initCounterNotify again, where initCounterNotify stops the timer held by the gc goroutines created in step 2: https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1289
  4. Timer.Stop does not close/end the channel: https://pkg.go.dev/time#Timer.Stop
  5. As a result, the gc goroutines are blocked on https://github.com/tikv/pd/blob/53b1fbf76a14527471295b8af84c85222a1be65c/client/resource_group/controller/controller.go#L1070 which will never be signaled

yzhan1 avatar Sep 16 '25 20:09 yzhan1

Given timer's channel is receiving only, we could add a separate channel to signal the GC goroutines when things are getting closed/stopped.

@nolouch @rleungx WDYT? Feel free to forward to other reviewers/owners of the module. Thanks!

yzhan1 avatar Sep 16 '25 20:09 yzhan1

/cc @JmPotato

rleungx avatar Sep 17 '25 01:09 rleungx

@JmPotato PTAL https://github.com/tikv/pd/pull/9749 Thanks!

yzhan1 avatar Sep 17 '25 04:09 yzhan1