paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[Feature] Add support of pluggable Hash function for paimon bucket

Open Aitozi opened this issue 9 months ago • 1 comments

Search before asking

  • [x] I searched in the issues and found nothing similar.

Motivation

Currently, paimon's bucket hash's hash value is based on BinaryRow#hashcode. It has two drawbacks.

  1. The hash's logic is bind with the BinaryRow structure, So if we want to join the paimon table with another Hive table it's need a shuffle for the one side of the table, because the distribution hash function is different
  2. To reduce the shuffle, if we want to use the paimon as bucket table, we need to reshuffle the hive table by the paimon's hash rule, but the BinaryRow based hash logic is hard to port to other engine.

So, I propose to make the hash function pluggable. In this way, we could introduce the HiveHash or the other to work with other compute engine.

Solution

  1. Introduce the hash function interface
public interface HashFunction {

    int hash(BinaryRow row);

}
  1. Adapt the read and writer to the new hash function interface

Anything else?

No response

Are you willing to submit a PR?

  • [x] I'm willing to submit a PR!

Aitozi avatar Apr 09 '25 12:04 Aitozi

CC @JingsongLi

Aitozi avatar Apr 09 '25 12:04 Aitozi