hudi
hudi copied to clipboard
[HUDI-6150] Support bucketing for each hive client
Change Logs
This :
- introduce a new hive bucketing spec to be propagated to each client
- implement hms and glue
- change implementation of hiveql
- TODO? support sorting
BTW I am still not sure the simple bucket index can be considered as hive bucketing because according to https://issues.apache.org/jira/browse/SPARK-19256:
- Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash
- Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
but so far I am not sure what the current status of hudi hashing
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [x] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
CI report:
- 1b0c06abe95feedd2f03f3507edce1cc4d7c3008 UNKNOWN
- d486fba35f93c250625eeaaefbbfe4c076f5cb0d Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azure
re-run the last Azure build
but so far I am not sure what the current status of hudi hashing
It uses only simple Java hashcode: https://github.com/apache/hudi/blob/20938c30b168d63cf4e520c6b4e1d7b930bed1ab/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L52
Can you elaborate a little more what the specific functionality of hashing alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of choices.
Can you elaborate a little more what the specific functionality of hashing alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of choices.
According to https://issues.apache.org/jira/browse/SPARK-19256 hive itself (and also presto/trino) are not able to use the spark hashing algorithm (and also file names specs + number of files and sorting). Moreover spark is not able itself to exploit hive bucketing.
So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:
- the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
- this current PR
- the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
- the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
Thanks for the detailed analysis, so what the actions that we can do to make the Hive bucket table take effect on Hive/Presto? Is it as easy as switching to a different hashing algorithm?
If compatible with hudi bucketing, we could provide multiple configuration for the bucketing up to the user to select the one they dlike. I can see several aspect that vary such :
- hashing
- file naming
- file numbering
- file sorting
As for file numbering I guess simple bucket could support any but consistent hashing would only be supported by hive3/spark3 since they allow more than one file per bucket. BTW athena v3 support both spark/ hive bucketing https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html#engine-versions-reference-0003-improvements-and-new-features
On May 8, 2023 11:00:03 AM UTC, Danny Chan @.***> wrote:
- the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
Thanks for the detailed analysis, so what the actions that we can do to make the Hive bucket table take effect on Hive/Presto? Is it as easy as switching to a different hashing algorithm?
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/8657#issuecomment-1538176946 You are receiving this because you authored the thread.
Message ID: @.***>
- hashing - file naming - file numbering - file sorting
Can you elaborate a little more about these items?
Bucket File pattern:
Hashing:
- hudi: java hashcode
- spark: murmur3 hash
- Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash
- Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
File numbering:
File sorting:
bucketId_xxx
So it seems the naming convention used by Hudi is compatible with Hive in general(not Spark or Trino), the only concern is the hasing algorithm, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization. Can you double check that part?
, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization
not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes
, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization
not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes
Yes, I guess so, because that is how the bucket pruning works, I'm wondering whether we should make the bucketing alsorithm configurable, it should be feasible if we use the Hive murmur3hash
algorithm.
Hardcoding Murmur is likely a good idea but it would break existing bucketed tables. Also it would!t support hive2 users.
As for file naming I suspect by adding the bucket also before the mime type (and keeping the prefix, so ${bucketId}_${filegroupId}_${UUID}_${timestamp}_{bucketid}.parquet/log
it would allow to support both spark 2 and all spark 3 releases.
On May 10, 2023 8:58:26 AM UTC, Danny Chan @.***> wrote:
, I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization
not sure to understand. Do you mean the hashing algorithm must be the same as the target engine ? The answer is definitely yes
Yes, I guess so, because that is how the bucket pruning works, I'm wondering whether we should make the bucketing alsorithm configurable, it should be feasible if we use the Hive
murmur3hash
algorithm.-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/8657#issuecomment-1541617605 You are receiving this because you authored the thread.
Message ID: @.***>
Hardcoding Murmur is likely a good idea
Not hardcoding, I mean to make it configurable, the user chooses the algorithm they desire to use.
it would allow to support both spark 2 and all spark 3 releases.
We can dig that further, but hitherto I would rather to keep it simple as before.
I dig a bit in the spark murmur 3 implementation. It is not standard at least for two reason:
- they use a hardcoded seed = 42 (which likely would not be the same as hive)
- they claim their way of dealing with murmur is not standard there is an issue about this and a other implementation (=hashUnsafeBytes2) exists, but it is not used so far.
Then I am not sure we could use guava murmur3 as is
The spark implementation is based on catalyst expression while in hudi we work with java types. If we want to use their implementation we should import spark-unsafe as a dependency in the hudi-client-common. We could also copy their implementation within hudi and maintain it. However in both case we would have to convert basic java types into catalyst types to be able to re-use the spark implementation (see https://github.com/apache/spark/blob/v3.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L523-L596). I am not sure it is a good design to introduce spark concepts within hudi-client-common
I am not sure it is a good design to introduce spark concepts within hudi-client-common
Obviously it is a bad design that we should avoid to take, can we just impl the whole spark murmur 3 as a whole in Hudi, I mean the data types is not that big deal we can use the Avro data types instead, or just use Spark data type for Spark impl and Flink data type for Flink impl.
Hello, any news on this? I agree with the following points:
So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:
- the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
- this current PR
- the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
cc @parisni Are you still on this?