pd icon indicating copy to clipboard operation
pd copied to clipboard

Enhance the region balance of the 1M tables imported by lightning

Open HuSharp opened this issue 1 year ago • 3 comments

Development Task

Background

  • PD ensures that the table level's scatter, but it doesn't care about table-to-table, which relies on balance-scheduler.
  • Balance-Region will not schedule empty region, There is a hardcode in https://github.com/tikv/pd/blob/7e18a69734f2ae391516625ec0230c05b363f5c5/pkg/schedule/filter/region_filters.go#L150-L153
  • When lightning imports tables, the table's region key is encoded in the table ID, while table IDs are created consecutively(basically next to each other). https://github.com/tikv/client-go/blob/6ba909c4ad2de65b5b36d0e5036d0a85f3154cc0/tikv/split_region.go#L241-L247

Problems faced

For lightning importing 1 million tables(one table corresponds to one region), even though there are more than 3 stores, consecutive region keys will generate a lot of regions aggregations in the first 3 stores. And since regions are not scheduled, the three stores have a high probability of OOM.

img_v3_02cu_570929e1-9590-4330-8034-ed3a92434bfg

HuSharp avatar Jul 22 '24 02:07 HuSharp

i'm interesting in the issue.

JQWong7 avatar Jul 26 '24 09:07 JQWong7

/assign @River2000i

JQWong7 avatar Jul 26 '24 09:07 JQWong7

When lightning imports tables, the table's region key is encoded in the table ID, while table IDs are created consecutively(basically next to each other). https://github.com/tikv/client-go/blob/6ba909c4ad2de65b5b36d0e5036d0a85f3154cc0/tikv/split_region.go#L241-L247

PD schedule scatter region base on region_count in all store. And schedule new region to fewest store. https://github.com/tikv/pd/blob/13174b5d4cab4d958f0b4ea2718ea7bb74992bc7/pkg/schedule/scatter/region_scatterer.go#L346 PD will compare region_count base on the group. For now, gourp define by table ID.(every table belong to a group) https://github.com/tikv/pd/blob/13174b5d4cab4d958f0b4ea2718ea7bb74992bc7/pkg/schedule/scatter/region_scatterer.go#L367 Summary:

  1. If there are lots of new split empty regions(close kv range), the region will split on the same store. In group level, scatter base on the region_count in a group. It will not scatter region to the store , which not contain the region belongs to the group.
  2. In cluster level, scatter region base on the region_count in whole cluster. Region is balanced in whole cluster, but not balanced in group or table level.

root cause: selectNewPeer() will pick origin store in first time trigger scatter region. Since it will compare all stores's region count in engineContext, but first time call context.selectedPeer.Get() will get 0 in all stores, so it will not scatter region in all stores. https://github.com/tikv/pd/blob/13174b5d4cab4d958f0b4ea2718ea7bb74992bc7/pkg/schedule/scatter/region_scatterer.go#L444 If we want to schedule scatter region in cluster level, we can call ScatterRegion with the same group. It will be an options for caller.

JQWong7 avatar Jul 30 '24 04:07 JQWong7