spark
spark copied to clipboard
[SPARK-50005][SQL] Enhance method verifyNotReadPath to identify subqueries hidden in the filter conditions.
What changes were proposed in this pull request?
Enhance method verifyNotReadPath to identify subqueries hidden in the filter conditions.
Why are the changes needed?
SparkSQL will throw exception if outputPath tries to overwrite inputpath. You can see the specific validation method 'verifyNotReadPath()' in the ddl.scala. SparkSQL can identify simple scenario such as:
insert overwrite table output_t select * from output_t;
However,SparkSQL cannot identify more complex scenarios where the subquery is hidden within filter conditions, such as:
insert overwrite table output_t select * from input_t ta where not exists(select tb.id from output_t tb where tb.id = ta.id);
insert overwrite table output_t select * from input_t ta where ta.id in (select id from output_t );
insert overwrite table output_t select * from input_t ta where ta.id < (select max(tb.id) from output_t tb where tb.id=ta.id);
In these scenarios above, SparkSQL throws an exception with the message 'java.io.FileNotFoundException: File does not exist' which can be confusing.
Does this PR introduce any user-facing change?
Yes. The aforementioned SQL will throw a more explicit exception with the message 'Can't overwrite a path that is also being read from'.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.