[SPARK-48881][SQL] Some dynamic partitions can be compensated to specific partition values
What changes were proposed in this pull request?
When writing dynamic partitions, some dynamic partitions in InsertIntoHadoopFsRelationCommand can be compensated to specific partition values.
Why are the changes needed?
When executing the following SQL : INSERT OVERWRITE TABLE A PARTITION(event_day, event_type) SELECT id, event_day, event_type from B where event_day = '20240712'
before this pr staticPartitions would be empty in InsertIntoHadoopFsRelationCommand
after this pr staticPartitions would be specific partition values
The benefit is that when executing
in InsertIntoHadoopFsRelationCommand, it can greatly reduce the pressure on hive and improve the efficiency of task execution.
Does this PR introduce any user-facing change?
No
How was this patch tested?
manually test
Was this patch authored or co-authored using generative AI tooling?
No
I haven't reviewed the code changes in the pr yet, but:
- The PR title should reflect the work done in the current PR as much as possible, the current title looks more like a Jira title
- Please ensure the completeness of the PR description, and the following parts are also required to be filled in:
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
### Was this patch authored or co-authored using generative AI tooling?
- In the PR description, it mentions
in InsertIntoHadoopFsRelationCommand, it can greatly reduce the pressure on hive and improve the efficiency of task execution.Is this quantifiable?
@fusheng-rd Could you add some new test cases for this?
I haven't reviewed the code changes in the pr yet, but:
- The PR title should reflect the work done in the current PR as much as possible, the current title looks more like a Jira title
- Please ensure the completeness of the PR description, and the following parts are also required to be filled in:
### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling?
- In the PR description, it mentions
in InsertIntoHadoopFsRelationCommand, it can greatly reduce the pressure on hive and improve the efficiency of task execution.Is this quantifiable?
When a table named A with millions of partitions executes the following sql:
INSERT OVERWRITE TABLE A PARTITION(event_day, event_type) SELECT id, event_day, event_type from B where event_day = '20240712'
It will appear in hive that all partitions of table A are fetched at once, which is very likely to cause slow queries in hive metastore and even drag down the overall performance of Hive's metadata queries. At the same time, the job execution is very slow.
After this PR, only the specified partition and its sub-partitions are fetched, which takes milliseconds to seconds.
@fusheng-rd hmm... It seems that the way you merge the code is not quite right.
@fusheng-rd hmm... It seems that the way you merge the code is not quite right. Yes, I am fixing that
@fusheng-rd Could you add some new test cases for this?
ok,done~
The idea LGTM. I think it's cleaner if the implementation is an optimizer rule that rewrites InsertIntoHadoopFsRelation command. We can look at the constraints of the input query, if we can infer an equality predicate against a partition column, we can fill the static partitions of the InsertIntoHadoopFsRelation command.
The idea LGTM. I think it's cleaner if the implementation is an optimizer rule that rewrites
InsertIntoHadoopFsRelationcommand. We can look at the constraints of the input query, if we can infer an equality predicate against a partition column, we can fill the static partitions of theInsertIntoHadoopFsRelationcommand.
ok,I will implement it in the optimization rule
Please help review it when you have free time, thanks! @ulysses-you cc @cloud-fan