paimon
paimon copied to clipboard
[Feature] Add support of pluggable Hash function for paimon bucket
Search before asking
- [x] I searched in the issues and found nothing similar.
Motivation
Currently, paimon's bucket hash's hash value is based on BinaryRow#hashcode. It has two drawbacks.
- The hash's logic is bind with the BinaryRow structure, So if we want to join the paimon table with another Hive table it's need a shuffle for the one side of the table, because the distribution hash function is different
- To reduce the shuffle, if we want to use the paimon as bucket table, we need to reshuffle the hive table by the paimon's hash rule, but the BinaryRow based hash logic is hard to port to other engine.
So, I propose to make the hash function pluggable. In this way, we could introduce the HiveHash or the other to work with other compute engine.
Solution
- Introduce the hash function interface
public interface HashFunction {
int hash(BinaryRow row);
}
- Adapt the read and writer to the new hash function interface
Anything else?
No response
Are you willing to submit a PR?
- [x] I'm willing to submit a PR!
CC @JingsongLi