hudi [HUDI-6150] Support bucketing for each hive client

trafficstars

Change Logs

This :

introduce a new hive bucketing spec to be propagated to each client
implement hms and glue
change implementation of hiveql
TODO? support sorting

BTW I am still not sure the simple bucket index can be considered as hive bucketing because according to https://issues.apache.org/jira/browse/SPARK-19256:

Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash
Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.

but so far I am not sure what the current status of hudi hashing

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

[x] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

May 06 '23 20:05 parisni

CI report:

1b0c06abe95feedd2f03f3507edce1cc4d7c3008 UNKNOWN
d486fba35f93c250625eeaaefbbfe4c076f5cb0d Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

May 06 '23 22:05 hudi-bot

but so far I am not sure what the current status of hudi hashing

It uses only simple Java hashcode: https://github.com/apache/hudi/blob/20938c30b168d63cf4e520c6b4e1d7b930bed1ab/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L52

Can you elaborate a little more what the specific functionality of hashing alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of choices.

May 08 '23 03:05 danny0405

Can you elaborate a little more what the specific functionality of hashing alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of choices.

According to https://issues.apache.org/jira/browse/SPARK-19256 hive itself (and also presto/trino) are not able to use the spark hashing algorithm (and also file names specs + number of files and sorting). Moreover spark is not able itself to exploit hive bucketing.

So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:

the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
this current PR
the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

May 08 '23 10:05 parisni

the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

Thanks for the detailed analysis, so what the actions that we can do to make the Hive bucket table take effect on Hive/Presto? Is it as easy as switching to a different hashing algorithm?

May 08 '23 10:05 danny0405

If compatible with hudi bucketing, we could provide multiple configuration for the bucketing up to the user to select the one they dlike. I can see several aspect that vary such :

hashing
file naming
file numbering
file sorting

As for file numbering I guess simple bucket could support any but consistent hashing would only be supported by hive3/spark3 since they allow more than one file per bucket. BTW athena v3 support both spark/ hive bucketing https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html#engine-versions-reference-0003-improvements-and-new-features

On May 8, 2023 11:00:03 AM UTC, Danny Chan @.***> wrote:

the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

Thanks for the detailed analysis, so what the actions that we can do to make the Hive bucket table take effect on Hive/Presto? Is it as easy as switching to a different hashing algorithm?

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/8657#issuecomment-1538176946 You are receiving this because you authored the thread.

Message ID: @.***>

May 08 '23 14:05 parisni

hashing - file naming - file numbering - file sorting

Can you elaborate a little more about these items?

May 09 '23 02:05 danny0405

Bucket File pattern:

hudi: ${bucketId}_${filegroupId}_${UUID}_${timestamp}.parquet/log
spark
hive3
hive2
hive1
trino

Hashing:

hudi: java hashcode
spark: murmur3 hash
Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash
Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.

File numbering:

whole story here

File sorting:

see "What about the sort" section_

May 09 '23 21:05 parisni

bucketId_xxx

So it seems the naming convention used by Hudi is compatible with Hive in general(not Spark or Trino), the only concern is the hasing algorithm, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization. Can you double check that part?

May 10 '23 03:05 danny0405

, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization

not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes

May 10 '23 08:05 parisni

, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization

not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes

Yes, I guess so, because that is how the bucket pruning works, I'm wondering whether we should make the bucketing alsorithm configurable, it should be feasible if we use the Hive murmur3hash algorithm.

May 10 '23 08:05 danny0405

Hardcoding Murmur is likely a good idea but it would break existing bucketed tables. Also it would!t support hive2 users.

As for file naming I suspect by adding the bucket also before the mime type (and keeping the prefix, so ${bucketId}_${filegroupId}_${UUID}_${timestamp}_{bucketid}.parquet/log it would allow to support both spark 2 and all spark 3 releases.

On May 10, 2023 8:58:26 AM UTC, Danny Chan @.***> wrote:

, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization

not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes

Yes, I guess so, because that is how the bucket pruning works, I'm wondering whether we should make the bucketing alsorithm configurable, it should be feasible if we use the Hive murmur3hash algorithm.

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/8657#issuecomment-1541617605 You are receiving this because you authored the thread.

Message ID: @.***>

May 10 '23 12:05 parisni

Hardcoding Murmur is likely a good idea

Not hardcoding, I mean to make it configurable, the user chooses the algorithm they desire to use.

it would allow to support both spark 2 and all spark 3 releases.

We can dig that further, but hitherto I would rather to keep it simple as before.

May 11 '23 03:05 danny0405

I dig a bit in the spark murmur 3 implementation. It is not standard at least for two reason:

they use a hardcoded seed = 42 (which likely would not be the same as hive)
they claim their way of dealing with murmur is not standard there is an issue about this and a other implementation (=hashUnsafeBytes2) exists, but it is not used so far.

Then I am not sure we could use guava murmur3 as is

The spark implementation is based on catalyst expression while in hudi we work with java types. If we want to use their implementation we should import spark-unsafe as a dependency in the hudi-client-common. We could also copy their implementation within hudi and maintain it. However in both case we would have to convert basic java types into catalyst types to be able to re-use the spark implementation (see https://github.com/apache/spark/blob/v3.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L523-L596). I am not sure it is a good design to introduce spark concepts within hudi-client-common

May 14 '23 14:05 parisni

I am not sure it is a good design to introduce spark concepts within hudi-client-common

Obviously it is a bad design that we should avoid to take, can we just impl the whole spark murmur 3 as a whole in Hudi, I mean the data types is not that big deal we can use the Avro data types instead, or just use Spark data type for Spark impl and Flink data type for Flink impl.

May 15 '23 02:05 danny0405

Hello, any news on this? I agree with the following points:

So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:

the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync

this current PR

the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

May 13 '24 17:05 ziudu

cc @parisni Are you still on this?

May 14 '24 01:05 danny0405

hudi hudi copied to clipboard

[HUDI-6150] Support bucketing for each hive client

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

CI report:

hudi
hudi copied to clipboard