pd icon indicating copy to clipboard operation
pd copied to clipboard

Support data affinity placement schedule for performance optimization

Open lhy1024 opened this issue 3 months ago • 2 comments

Development Task

Background

As a distributed database, TiDB inherently incurs network overhead that is not present in single-node databases. For many workloads, data from the same table or partition is frequently accessed together. The distribution of this data across different TiKV nodes can lead to significant cross-node communication for query execution, increasing latency and consuming more resources.

In our real-world load testing, we have observed that by ensuring data and indexes for the same table or partition are consolidated within a single Region, we can achieve substantial performance gains. Specifically, this co-location leads to higher QPS, lower query latency, and a notable reduction in CPU and network I/O resource consumption.

This feature will unlock key optimization opportunities, such as:

  • More Read and Write Push Down
  • More 1PC txn requests

Demo Test Result

Metric index lookup pushdown data affinity 1PC Both
read queries +7.59% +7.12% +18.41%
write queries +7.59% +7.12% +18.41%
avg latency (ms) -7.05% -6.64% -15.54%
95th percentile (ms) -5.26% -5.26% -14.96%

Image

Image

Plan

We would like to introduce a mechanism to enforce region affinity for specific tables or partitions. This feature would instruct PD to schedule and merge all regions within a specified key range (e.g., a single table's data and its indexes) and prevent them from being split or merged or scheduled with other regions under normal circumstances.

This can be controlled via a dedicated label, for example:

  • A generic affinity label: affinity = true

  • Or a set of more granular labels for fine-grained control:

    • auto_split = deny
    • schedule = deny
    • merge = deny

TiKV

We need to disable split by size and load base split for regions within the target key range.

  1. Reason for Split Request: Add a reason field to the AskBatchSplitRequest. This allows PD to identify the source of the split request. If a request originates from a region where splitting is disabled, PD can reject it.

  2. Region ID Cache: Implement a cache for region IDs that have splitting disabled. This cache can have a capacity (e.g., 10,000 entries) and a TTL (e.g., 10 minutes) and would be populated and updated via PD heartbeats. We need to add a ChangeAutoSplit directive to the RegionHeartbeatResponse.

  3. Prevent "Super-Huge" Regions: To avoid uncontrolled growth, introduce a configurable coefficient. If a region grows beyond a certain threshold (e.g., 10 times the configured region-max-size), it will be unconditionally split to maintain stability.

PD

  1. One-Time Data Migration: For existing tables or partitions where affinity is newly enabled, PD must initiate a one-time task. This task will be responsible for splitting, scheduling, and merging the relevant regions to consolidate them. This could be implemented as a new command in pd-ctl or use existing tools.

  2. Exclude from Standard Scheduling: We need to add a schedule filter for it. The key ranges managed by the affinity feature must be exempted from certain standard scheduling and balancing operations, including:

  • balance-region-scheduler
  • hot-region-scheduler
  • Crucially, the merge-checker must be prevented from merging these affinity-managed regions with regions from other tables or key ranges. Otherwise, with Split and Scheduler disabled, these ranges that do not belong to an affinity group cannot be separated or scheduled once merged.

Observability  

To ensure the feature is manageable and its impact is understood, some metrics should be introduced in a future phase.

lhy1024 avatar Sep 22 '25 05:09 lhy1024

This will work only for table which fit a single region along with the index, right?

Tema avatar Sep 22 '25 16:09 Tema

This will work only for table which fit a single region along with the index, right? The goal is not only to force a table (or partition) into a single, massive region. Additionally, the goal is to ensure that all regions belonging to a specific table (or partition) are scheduled onto a store.

While the feature prevents auto splitting to keep the regions from being unnecessarily fragmented, it also includes a safeguard to split a "super-huge" region if it grows beyond a safe threshold.

lhy1024 avatar Sep 23 '25 03:09 lhy1024