tikv icon indicating copy to clipboard operation
tikv copied to clipboard

Adaptive read pool threads

Open mittalrishabh opened this issue 10 months ago • 2 comments

Feature Request

Read pool thread is configured to be very large number (5x of CPU) because our average EBS latency are in the range of 1ms. However, this large read pool sometimes leads to CPU bottlenecks in TIKV when there are hot regions. As a result, the blast radius increases, and the cluster experiences extended recovery times since other threads, such as raft and resolved-ts threads, become slower. Given the high EBS latency and the use of RocksDB in synchronous mode, we recommend making the system more adaptive by implementing the following changes:

  1. Limit the CPU utilization of the unified read pool thread through a configuration parameter.
  2. Currently, the unified read pool scales up or down based solely on the CPU utilization of each thread. It should also consider the wait time of tasks to determine when to adjust its size.

This solution is also useful to enable resource control group(RCG). RCG fair scheduling in TiKV doesn't work, impacting all the tenants if TiKV CPU becomes a bottleneck

Is your feature request related to a problem? Please describe:

Describe the feature you'd like:

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

mittalrishabh avatar May 19 '25 20:05 mittalrishabh

@mittalrishabh have you considered async io for the coprocessor read?

zhangjinpeng87 avatar May 20 '25 22:05 zhangjinpeng87

are you talking about rocksDB async IO.

mittalrishabh avatar May 20 '25 23:05 mittalrishabh