paimon
paimon copied to clipboard
[Feature] Reduce redundant shuffle for spark dynamic bucket writes
Search before asking
- [X] I searched in the issues and found nothing similar.
Motivation
Dynamic bucket writing does two shuffles, the first repartitionByKeyPartitionHash seems unnecessary, It seems to be only used to determine assignId. However, assignId can be calculated through partitionHash/keyHash/numParallelism/numAssigners, we do not need to do extra shuffle. Can we remove it?
https://github.com/apache/paimon/blob/e27ceb464244f5a0c2bfa2a7c6db649ca945212b/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala#L143
Solution
No response
Anything else?
No response
Are you willing to submit a PR?
- [x] I'm willing to submit a PR!
@YannByron could you please take a look?
it is hard, Perhaps different assigners will have the same bucket data