spark
spark copied to clipboard
[SPARK-49881][SQL] Skipping DeduplicateRelations rule conditionally to improve analyzer perf
What changes were proposed in this pull request?
This PR is intended to improve analyzer performance by skip applying the DeduplicateRelations rule when there is a guarantee that there are no duplicate relations present in the query plan, upfront.
The code changes involved are:
- Storing the MultiInstanceRelations present in a query plan, as a field in QueryExecution.
- If its possible to know the Relations which will be present in a query , before hand, then a non empty relations set is passed to the constructor of QueryExecution.
- If it is known upfront that there will be duplicate relations, then an empty set is passed to the QueryExecution constructor. 4)If the relations set is non empty, then a Marker Plan is put on the top node , before analysis.
- In the DedupRelations rule, the presence of marker plan tells the code, to skip applying the rule, removes the marker node, and stores the info in AnalysisContext , for subsequent iterations of the batch.
- The reason for storing the info in Analysis Context is because, some rules like ( Windowing rules related) may strip the top node ( in this case the Marker node).
- For DataFrame apis involving only 1 dataset, there is a guarantee that duplicate relations cannot occur, so long as subqueries are not present in the projection or filter clause (which is easily determined using the bitmap present in the Node) and so the existing relations set can be passed to the QueryExecution constructor, when creating new dataframes.
- For operations like join/union etc where there are 2 datasets involved, if the intersection of the two sets is empty , then that should mean that duplicate relations are not possible, and we can pass set1 ++ set2 in the resultant QueryExecution
- If the above set1. intersect. set2 is non empty, implying duplicate relations, in which case QueryExecution will be passed an empty relations set.
Why are the changes needed?
The deduplication rule is an expensive rule due to its nature of detecting the duplicate relations and then altering the exprIDs etc. There are situations which can tell, whether a given plan will have any duplicate relations or not. Based on that knowledge the rule can be skipped.
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing tests should pass. new tests will be added.
Was this patch authored or co-authored using generative AI tooling?
No