[Feature] Support to include partition in the outputPartition

Open Aitozi opened this issue 9 months ago • 0 comments

Search before asking

[x] I searched in the issues and found nothing similar.

Motivation

In spark SPJ, it supports to report the partition key as the output partition. We could leverage this to increase the parallelism when working with cross partition join.

For example:

t1:

CREATE TABLE t1 (id INT, dt STRING, app STRING) PARTITIONED BY (dt, app) TBLPROPERTIES ('bucket'='10', 'bucket-key' = 'id')

t2:

CREATE TABLE t2 (id INT, dt STRING, app STRING) PARTITIONED BY (dt, app) TBLPROPERTIES ('bucket'='10', 'bucket-key' = 'id')

SELECT * FROM t1 JOIN t2 on t1.id = t2.id and t1.app = t2.app where dt = '20250316'

If we only take the bucket into account, it only could run with the 10 parallelism limited by the bucket. We could let the app also make up the output partition, which will increase the parallelism and bring better performance

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

[x] I'm willing to submit a PR!

Mar 16 '25 15:03 Aitozi