hudi [HUDI-4824]Add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

trafficstars

Change Logs

Usually, in the mysql... table, there is an auto-increment primary key, base this fact and Bucket index , we propose a Range_Bucket index. And get a good performence in my practice.

Base concept is like Bucket index, and most important is: bucketId = primaryKey / stepSize For example, if set stepSize = 2 bucketId Mapping will like this: pKey, bucketId 1, 0 2, 0 3, 1 4, 1 5, 2 6, 2 7, 3 ...

In my practice, I sync about 1000+ mysql table to hudi, and I set setpSize = 1,500,000, and basefile size will be about 50m - 350m.

Test like this, is set stepSize = 2 and pKey = {1, 4, 9}, will get three base file:

Impact

Introduce a new index RANGE_BUCKET, people can ust it like following:

  option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
  option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
  option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
  option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
  option(HoodieLayoutConfig.LAYOUT_PARTITIONER_CLASS_NAME.key, classOf[SparkRangeBucketIndexPartitioner[_]].getName).

Risk level: none | low | medium | high

low

Sep 08 '22 12:09 wqwl611

Nice feature, can we log an JIRA issue and change the commit title to "[HUDI-${JIRA_ID}] ${your title}"

Sep 09 '22 01:09 danny0405

Nice feature, can we log an JIRA issue and change the commit title to "[HUDI-${JIRA_ID}] ${your title}"

@danny0405 yes，Thanks。

Sep 09 '22 02:09 wqwl611

https://issues.apache.org/jira/browse/HUDI-4824 @danny0405

Sep 09 '22 17:09 wqwl611

Hey, thanks for the contribution. It is a great enhancement for bucket index.

On high-level, could we use the current BucketIndex abstraction to unify the implementation of different BucketIndexEngines? Also, the dedicated Partitioner (i.e., SparkRangeBucketIndexPartitioner) may not be necessary, as long as we tag the file id during indexing (checkout consistent hashing which uses default Partitioner).

 Right now, rangBucketIndex generate file like "00000009-0_2-12-29_20220924180225595.parquet", and it doesn't contain any UUID element,  I think it's ok, am I right?
 By this clue, if simpleBucketIndex also act like this, SparkBucketIndexPartitioner may not be necessary eigther? and if use default partitioner， it can reduce a lot of empty spark-task。

@YuweiXiao

Sep 24 '22 10:09 wqwl611

Hey, thanks for the contribution. It is a great enhancement for bucket index. On high-level, could we use the current BucketIndex abstraction to unify the implementation of different BucketIndexEngines? Also, the dedicated Partitioner (i.e., SparkRangeBucketIndexPartitioner) may not be necessary, as long as we tag the file id during indexing (checkout consistent hashing which uses default Partitioner).
 Right now, rangBucketIndex generate file like "00000009-0_2-12-29_20220924180225595.parquet", and it doesn't contain any UUID element,  I think it's ok, am I right?
 By this clue, if simpleBucketIndex also act like this, SparkBucketIndexPartitioner may not be necessary eigther? and if use default partitioner， it can reduce a lot of empty spark-task。
@YuweiXiao

Yeah, I was thinking the same thing, have id as the name rather than concatenating the uuid. But I think the benefit is saving the metadata loading overhead (i.e., listing to get the filename) rather than the one you mentioned. With the default partitioner, it should not be empty partition (UpsertPartitioner). Please correct me if I am wrong.

Also, we better to follow the naming convention of the file group, in case of potential compatibility problems.

Sep 25 '22 07:09 YuweiXiao

Hey, thanks for the contribution. It is a great enhancement for bucket index. On high-level, could we use the current BucketIndex abstraction to unify the implementation of different BucketIndexEngines? Also, the dedicated Partitioner (i.e., SparkRangeBucketIndexPartitioner) may not be necessary, as long as we tag the file id during indexing (checkout consistent hashing which uses default Partitioner).

Right now, rangBucketIndex generate file like "00000009-0_2-12-29_20220924180225595.parquet", and it doesn't contain any UUID element, I think it's ok, am I right? By this clue, if simpleBucketIndex also act like this, SparkBucketIndexPartitioner may not be necessary eigther? and if use default partitioner， it can reduce a lot of empty spark-task。

@YuweiXiao

Yeah, I was thinking the same thing, have id as the name rather than concatenating the uuid. But I think the benefit is saving the metadata loading overhead (i.e., listing to get the filename) rather than the one you mentioned. With the default partitioner, it should not be empty partition (UpsertPartitioner). Please correct me if I am wrong.

Also, we better to follow the naming convention of the file group, in case of potential compatibility problems.

yes，every empty bucket will access metadata in ‘getBucketInfo’，when partitionNum * bucketNum is very big，it‘s a heary overhead for metadata （and spark driver scheduler don't like it eigther）。 More important！I'am afraid that we can't follow ‘uuid naming convention’， because this name is genarated in rdd task one by one record but not a one by one bucket like simpleBucketIndex rigtht now @YuweiXiao

@danny0405 please help look at this.

Sep 25 '22 08:09 wqwl611

@wqwl611 would really love to understand your use-case a little better and how RANGE_BUCKET can address it better than existing implementations (for ex Bucket Index SIMPLE engine-type). Can you please elaborate on this in the PR description?

Sep 26 '22 19:09 alexeykudinkin

@alexeykudinkin I update this PR Change Logs, please check it

Sep 27 '22 03:09 wqwl611

@wqwl611 there are a bunch of flakiness fixes went in master. you might want to rebase

Sep 29 '22 11:09 xushiyan

@wqwl611 there are a bunch of flakiness fixes went in master. you might want to rebase

@xushiyan I rebase my repo wrongly，and I delete it and refork it，then apply all the changes，how can I use the old pr？

Sep 30 '22 08:09 wqwl611

CI report:

47864db3234b5dd7115006db07019481c91d65f8 UNKNOWN
5bcf55eb76b9e7abf7a18ffcedfe638f4d11c643 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Sep 30 '22 11:09 hudi-bot

I accidentally deleted my previous fork, so I opened a new PR, and modified the code according to the previous review requirements, please help me review it，thanks. @danny0405 @YuweiXiao @xushiyan @alexeykudinkin new PR: https://github.com/apache/hudi/pull/6858

Oct 03 '22 16:10 wqwl611

@wqwl611 thanks for the work but looks like this PR is identical to #6858 . should we continue using this one given we have reviews here already and close #6858 ?

Oct 13 '22 13:10 xushiyan

@wqwl611 thanks for the work but looks like this PR is identical to #6858 . should we continue using this one given we have reviews here already and close #6858 ?

I deleted my fork by mistake before, and now I have made changes in the new branch （HUDI-4824） in my new fork, so I created a new pr（6858）, and I don't confirm the current pr if is available？ @xushiyan

Oct 15 '22 18:10 wqwl611

The work continues in https://github.com/apache/hudi/pull/6858

Oct 31 '22 10:10 xushiyan

hudi hudi copied to clipboard

[HUDI-4824]Add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

Change Logs

Impact

CI report:

hudi
hudi copied to clipboard