paimon
paimon copied to clipboard
[Feature] Support to include partition in the outputPartition
Search before asking
- [x] I searched in the issues and found nothing similar.
Motivation
In spark SPJ, it supports to report the partition key as the output partition. We could leverage this to increase the parallelism when working with cross partition join.
For example:
t1:
CREATE TABLE t1 (id INT, dt STRING, app STRING) PARTITIONED BY (dt, app) TBLPROPERTIES ('bucket'='10', 'bucket-key' = 'id')
t2:
CREATE TABLE t2 (id INT, dt STRING, app STRING) PARTITIONED BY (dt, app) TBLPROPERTIES ('bucket'='10', 'bucket-key' = 'id')
SELECT * FROM t1 JOIN t2 on t1.id = t2.id and t1.app = t2.app where dt = '20250316'
If we only take the bucket into account, it only could run with the 10 parallelism limited by the bucket. We could let the app also make up the output partition, which will increase the parallelism and bring better performance
Solution
No response
Anything else?
No response
Are you willing to submit a PR?
- [x] I'm willing to submit a PR!