hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-6150] Support bucketing for each hive client

Open parisni opened this issue 1 year ago • 16 comments

Change Logs

This :

  • introduce a new hive bucketing spec to be propagated to each client
  • implement hms and glue
  • change implementation of hiveql
  • TODO? support sorting

BTW I am still not sure the simple bucket index can be considered as hive bucketing because according to https://issues.apache.org/jira/browse/SPARK-19256:

  • Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash
  • Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.

but so far I am not sure what the current status of hudi hashing

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [x] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

parisni avatar May 06 '23 20:05 parisni

CI report:

  • 1b0c06abe95feedd2f03f3507edce1cc4d7c3008 UNKNOWN
  • d486fba35f93c250625eeaaefbbfe4c076f5cb0d Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar May 06 '23 22:05 hudi-bot

but so far I am not sure what the current status of hudi hashing

It uses only simple Java hashcode: https://github.com/apache/hudi/blob/20938c30b168d63cf4e520c6b4e1d7b930bed1ab/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L52

Can you elaborate a little more what the specific functionality of hashing alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of choices.

danny0405 avatar May 08 '23 03:05 danny0405

Can you elaborate a little more what the specific functionality of hashing alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of choices.

According to https://issues.apache.org/jira/browse/SPARK-19256 hive itself (and also presto/trino) are not able to use the spark hashing algorithm (and also file names specs + number of files and sorting). Moreover spark is not able itself to exploit hive bucketing.

So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:

  • the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
  • this current PR
  • the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

parisni avatar May 08 '23 10:05 parisni

  • the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

Thanks for the detailed analysis, so what the actions that we can do to make the Hive bucket table take effect on Hive/Presto? Is it as easy as switching to a different hashing algorithm?

danny0405 avatar May 08 '23 10:05 danny0405

If compatible with hudi bucketing, we could provide multiple configuration for the bucketing up to the user to select the one they dlike. I can see several aspect that vary such :

  • hashing
  • file naming
  • file numbering
  • file sorting

As for file numbering I guess simple bucket could support any but consistent hashing would only be supported by hive3/spark3 since they allow more than one file per bucket. BTW athena v3 support both spark/ hive bucketing https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html#engine-versions-reference-0003-improvements-and-new-features

On May 8, 2023 11:00:03 AM UTC, Danny Chan @.***> wrote:

  • the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

Thanks for the detailed analysis, so what the actions that we can do to make the Hive bucket table take effect on Hive/Presto? Is it as easy as switching to a different hashing algorithm?

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/8657#issuecomment-1538176946 You are receiving this because you authored the thread.

Message ID: @.***>

parisni avatar May 08 '23 14:05 parisni

  • hashing - file naming - file numbering - file sorting

Can you elaborate a little more about these items?

danny0405 avatar May 09 '23 02:05 danny0405

Bucket File pattern:

Hashing:

  • hudi: java hashcode
  • spark: murmur3 hash
  • Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash
  • Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.

File numbering:

File sorting:

parisni avatar May 09 '23 21:05 parisni

bucketId_xxx

So it seems the naming convention used by Hudi is compatible with Hive in general(not Spark or Trino), the only concern is the hasing algorithm, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization. Can you double check that part?

danny0405 avatar May 10 '23 03:05 danny0405

, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization

not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes

parisni avatar May 10 '23 08:05 parisni

, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization

not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes

Yes, I guess so, because that is how the bucket pruning works, I'm wondering whether we should make the bucketing alsorithm configurable, it should be feasible if we use the Hive murmur3hash algorithm.

danny0405 avatar May 10 '23 08:05 danny0405

Hardcoding Murmur is likely a good idea but it would break existing bucketed tables. Also it would!t support hive2 users.

As for file naming I suspect by adding the bucket also before the mime type (and keeping the prefix, so ${bucketId}_${filegroupId}_${UUID}_${timestamp}_{bucketid}.parquet/log it would allow to support both spark 2 and all spark 3 releases.

On May 10, 2023 8:58:26 AM UTC, Danny Chan @.***> wrote:

, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization

not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes

Yes, I guess so, because that is how the bucket pruning works, I'm wondering whether we should make the bucketing alsorithm configurable, it should be feasible if we use the Hive murmur3hash algorithm.

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/8657#issuecomment-1541617605 You are receiving this because you authored the thread.

Message ID: @.***>

parisni avatar May 10 '23 12:05 parisni

Hardcoding Murmur is likely a good idea

Not hardcoding, I mean to make it configurable, the user chooses the algorithm they desire to use.

it would allow to support both spark 2 and all spark 3 releases.

We can dig that further, but hitherto I would rather to keep it simple as before.

danny0405 avatar May 11 '23 03:05 danny0405

I dig a bit in the spark murmur 3 implementation. It is not standard at least for two reason:

  1. they use a hardcoded seed = 42 (which likely would not be the same as hive)
  2. they claim their way of dealing with murmur is not standard there is an issue about this and a other implementation (=hashUnsafeBytes2) exists, but it is not used so far.

Then I am not sure we could use guava murmur3 as is

The spark implementation is based on catalyst expression while in hudi we work with java types. If we want to use their implementation we should import spark-unsafe as a dependency in the hudi-client-common. We could also copy their implementation within hudi and maintain it. However in both case we would have to convert basic java types into catalyst types to be able to re-use the spark implementation (see https://github.com/apache/spark/blob/v3.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L523-L596). I am not sure it is a good design to introduce spark concepts within hudi-client-common

parisni avatar May 14 '23 14:05 parisni

I am not sure it is a good design to introduce spark concepts within hudi-client-common

Obviously it is a bad design that we should avoid to take, can we just impl the whole spark murmur 3 as a whole in Hudi, I mean the data types is not that big deal we can use the Avro data types instead, or just use Spark data type for Spark impl and Flink data type for Flink impl.

danny0405 avatar May 15 '23 02:05 danny0405

Hello, any news on this? I agree with the following points:

So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:

  • the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
  • this current PR
  • the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

ziudu avatar May 13 '24 17:05 ziudu

cc @parisni Are you still on this?

danny0405 avatar May 14 '24 01:05 danny0405