hudi
hudi copied to clipboard
[HUDI-4824]Add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table
Change Logs
Usually, in the mysql... table, there is an auto-increment primary key, base this fact and Bucket index , we propose a Range_Bucket index. And get a good performence in my practice.
Base concept is like Bucket index, and most important is: bucketId = primaryKey / stepSize For example, if set stepSize = 2 bucketId Mapping will like this: pKey, bucketId 1, 0 2, 0 3, 1 4, 1 5, 2 6, 2 7, 3 ...
In my practice, I sync about 1000+ mysql table to hudi, and I set setpSize = 1,500,000, and basefile size will be about 50m - 350m.
Test like this, is set stepSize = 2 and pKey = {1, 4, 9}, will get three base file:
Impact
Introduce a new index RANGE_BUCKET, people can ust it like following:
option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
option(HoodieLayoutConfig.LAYOUT_PARTITIONER_CLASS_NAME.key, classOf[SparkRangeBucketIndexPartitioner[_]].getName).
Risk level: none | low | medium | high
low
Nice feature, can we log an JIRA issue and change the commit title to "[HUDI-${JIRA_ID}] ${your title}"
Nice feature, can we log an JIRA issue and change the commit title to "[HUDI-${JIRA_ID}] ${your title}"
@danny0405 yes,Thanks。
https://issues.apache.org/jira/browse/HUDI-4824 @danny0405
Hey, thanks for the contribution. It is a great enhancement for bucket index.
On high-level, could we use the current BucketIndex abstraction to unify the implementation of different BucketIndexEngines? Also, the dedicated Partitioner (i.e., SparkRangeBucketIndexPartitioner) may not be necessary, as long as we tag the file id during indexing (checkout consistent hashing which uses default Partitioner).
Right now, rangBucketIndex generate file like "00000009-0_2-12-29_20220924180225595.parquet", and it doesn't contain any UUID element, I think it's ok, am I right?
By this clue, if simpleBucketIndex also act like this, SparkBucketIndexPartitioner may not be necessary eigther? and if use default partitioner, it can reduce a lot of empty spark-task。
@YuweiXiao
Hey, thanks for the contribution. It is a great enhancement for bucket index. On high-level, could we use the current BucketIndex abstraction to unify the implementation of different BucketIndexEngines? Also, the dedicated Partitioner (i.e., SparkRangeBucketIndexPartitioner) may not be necessary, as long as we tag the file id during indexing (checkout consistent hashing which uses default Partitioner).
Right now, rangBucketIndex generate file like "00000009-0_2-12-29_20220924180225595.parquet", and it doesn't contain any UUID element, I think it's ok, am I right? By this clue, if simpleBucketIndex also act like this, SparkBucketIndexPartitioner may not be necessary eigther? and if use default partitioner, it can reduce a lot of empty spark-task。
@YuweiXiao
Yeah, I was thinking the same thing, have id as the name rather than concatenating the uuid. But I think the benefit is saving the metadata loading overhead (i.e., listing to get the filename) rather than the one you mentioned. With the default partitioner, it should not be empty partition (UpsertPartitioner
). Please correct me if I am wrong.
Also, we better to follow the naming convention of the file group, in case of potential compatibility problems.
Hey, thanks for the contribution. It is a great enhancement for bucket index. On high-level, could we use the current BucketIndex abstraction to unify the implementation of different BucketIndexEngines? Also, the dedicated Partitioner (i.e., SparkRangeBucketIndexPartitioner) may not be necessary, as long as we tag the file id during indexing (checkout consistent hashing which uses default Partitioner).
Right now, rangBucketIndex generate file like "00000009-0_2-12-29_20220924180225595.parquet", and it doesn't contain any UUID element, I think it's ok, am I right? By this clue, if simpleBucketIndex also act like this, SparkBucketIndexPartitioner may not be necessary eigther? and if use default partitioner, it can reduce a lot of empty spark-task。
@YuweiXiao
Yeah, I was thinking the same thing, have id as the name rather than concatenating the uuid. But I think the benefit is saving the metadata loading overhead (i.e., listing to get the filename) rather than the one you mentioned. With the default partitioner, it should not be empty partition (
UpsertPartitioner
). Please correct me if I am wrong.Also, we better to follow the naming convention of the file group, in case of potential compatibility problems.
yes,every empty bucket will access metadata in ‘getBucketInfo’,when partitionNum * bucketNum is very big,it‘s a heary overhead for metadata (and spark driver scheduler don't like it eigther)。 More important!I'am afraid that we can't follow ‘uuid naming convention’, because this name is genarated in rdd task one by one record but not a one by one bucket like simpleBucketIndex rigtht now @YuweiXiao
@danny0405 please help look at this.
@wqwl611 would really love to understand your use-case a little better and how RANGE_BUCKET
can address it better than existing implementations (for ex Bucket Index SIMPLE
engine-type). Can you please elaborate on this in the PR description?
@alexeykudinkin I update this PR Change Logs, please check it
@wqwl611 there are a bunch of flakiness fixes went in master. you might want to rebase
@wqwl611 there are a bunch of flakiness fixes went in master. you might want to rebase
@xushiyan I rebase my repo wrongly,and I delete it and refork it,then apply all the changes,how can I use the old pr?
CI report:
- 47864db3234b5dd7115006db07019481c91d65f8 UNKNOWN
- 5bcf55eb76b9e7abf7a18ffcedfe638f4d11c643 Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azure
re-run the last Azure build
I accidentally deleted my previous fork, so I opened a new PR, and modified the code according to the previous review requirements, please help me review it,thanks. @danny0405 @YuweiXiao @xushiyan @alexeykudinkin new PR: https://github.com/apache/hudi/pull/6858
@wqwl611 thanks for the work but looks like this PR is identical to #6858 . should we continue using this one given we have reviews here already and close #6858 ?
@wqwl611 thanks for the work but looks like this PR is identical to #6858 . should we continue using this one given we have reviews here already and close #6858 ?
I deleted my fork by mistake before, and now I have made changes in the new branch (HUDI-4824) in my new fork, so I created a new pr(6858), and I don't confirm the current pr if is available? @xushiyan
The work continues in https://github.com/apache/hudi/pull/6858