paimon [Feature] Reduce redundant shuffle for spark dynamic bucket writes

[Feature] Reduce redundant shuffle for spark dynamic bucket writes

Open wForget opened this issue 1 year ago • 2 comments

Search before asking

[X] I searched in the issues and found nothing similar.

Motivation

Dynamic bucket writing does two shuffles, the first repartitionByKeyPartitionHash seems unnecessary, It seems to be only used to determine assignId. However, assignId can be calculated through partitionHash/keyHash/numParallelism/numAssigners, we do not need to do extra shuffle. Can we remove it?

https://github.com/apache/paimon/blob/e27ceb464244f5a0c2bfa2a7c6db649ca945212b/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala#L143

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

[x] I'm willing to submit a PR!

Apr 17 '24 03:04 wForget

@YannByron could you please take a look?

Apr 17 '24 03:04 wForget

it is hard, Perhaps different assigners will have the same bucket data

Apr 30 '24 11:04 JingsongLi

paimon paimon copied to clipboard

[Feature] Reduce redundant shuffle for spark dynamic bucket writes

Search before asking

Motivation

Solution

Anything else?

Are you willing to submit a PR?

paimon
paimon copied to clipboard