spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48881][SQL] Some dynamic partitions can be compensated to specific partition values

Open fusheng9399 opened this issue 1 year ago • 9 comments

What changes were proposed in this pull request?

When writing dynamic partitions, some dynamic partitions in InsertIntoHadoopFsRelationCommand can be compensated to specific partition values.

Why are the changes needed?

When executing the following SQL : INSERT OVERWRITE TABLE A PARTITION(event_day, event_type) SELECT id, event_day, event_type from B where event_day = '20240712'

before this pr staticPartitions would be empty in InsertIntoHadoopFsRelationCommand image

after this pr staticPartitions would be specific partition values image

The benefit is that when executing image

in InsertIntoHadoopFsRelationCommand, it can greatly reduce the pressure on hive and improve the efficiency of task execution.

Does this PR introduce any user-facing change?

No

How was this patch tested?

manually test

Was this patch authored or co-authored using generative AI tooling?

No

fusheng9399 avatar Jul 19 '24 08:07 fusheng9399

I haven't reviewed the code changes in the pr yet, but:

  1. The PR title should reflect the work done in the current PR as much as possible, the current title looks more like a Jira title
  2. Please ensure the completeness of the PR description, and the following parts are also required to be filled in:
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

  1. In the PR description, it mentions in InsertIntoHadoopFsRelationCommand, it can greatly reduce the pressure on hive and improve the efficiency of task execution. Is this quantifiable?

LuciferYang avatar Jul 22 '24 12:07 LuciferYang

@fusheng-rd Could you add some new test cases for this?

LuciferYang avatar Jul 23 '24 12:07 LuciferYang

I haven't reviewed the code changes in the pr yet, but:

  1. The PR title should reflect the work done in the current PR as much as possible, the current title looks more like a Jira title
  2. Please ensure the completeness of the PR description, and the following parts are also required to be filled in:
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
  1. In the PR description, it mentions in InsertIntoHadoopFsRelationCommand, it can greatly reduce the pressure on hive and improve the efficiency of task execution. Is this quantifiable?

When a table named A with millions of partitions executes the following sql:

INSERT OVERWRITE TABLE A PARTITION(event_day, event_type) SELECT id, event_day, event_type from B where event_day = '20240712' 

It will appear in hive that all partitions of table A are fetched at once, which is very likely to cause slow queries in hive metastore and even drag down the overall performance of Hive's metadata queries. At the same time, the job execution is very slow.

After this PR, only the specified partition and its sub-partitions are fetched, which takes milliseconds to seconds.

fusheng9399 avatar Jul 23 '24 14:07 fusheng9399

@fusheng-rd hmm... It seems that the way you merge the code is not quite right.

LuciferYang avatar Aug 07 '24 03:08 LuciferYang

@fusheng-rd hmm... It seems that the way you merge the code is not quite right. Yes, I am fixing that

fusheng9399 avatar Aug 07 '24 03:08 fusheng9399

@fusheng-rd Could you add some new test cases for this?

ok,done~

fusheng9399 avatar Aug 07 '24 08:08 fusheng9399

The idea LGTM. I think it's cleaner if the implementation is an optimizer rule that rewrites InsertIntoHadoopFsRelation command. We can look at the constraints of the input query, if we can infer an equality predicate against a partition column, we can fill the static partitions of the InsertIntoHadoopFsRelation command.

cloud-fan avatar Aug 14 '24 17:08 cloud-fan

The idea LGTM. I think it's cleaner if the implementation is an optimizer rule that rewrites InsertIntoHadoopFsRelation command. We can look at the constraints of the input query, if we can infer an equality predicate against a partition column, we can fill the static partitions of the InsertIntoHadoopFsRelation command.

ok,I will implement it in the optimization rule

fusheng9399 avatar Aug 30 '24 09:08 fusheng9399

Please help review it when you have free time, thanks! @ulysses-you cc @cloud-fan

fusheng9399 avatar Oct 11 '24 02:10 fusheng9399