spark
spark copied to clipboard
[SPARK-47672][SQL] Avoid double eval from filter pushDown
What changes were proposed in this pull request?
Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.
This introduces an "expectedCost" mechanism which we may or may not want. Previous filter ordering work used filter pushdowns as an approximation of expression cost but here we need more granularity. As an alternative we could introduce a flag for expensive rather than numeric operations.
Future Work / What else remains to do?
Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.
Why are the changes needed?
Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.
Does this PR introduce any user-facing change?
SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.
How was this patch tested?
New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist
Was this patch authored or co-authored using generative AI tooling?
No
oh this is a hard one. The cost of predicates is hard to estimate, and also the benefit as we need to estimate the selectivity and the input data volume.
cc @kelvinjian-db @jchen5
It is. In general I think since we still apply the filter post projection if a user has created a projection with a named field and then filtered on that field the user is probably doing that intentionally since they don't want to double eval the named field. That plus some basic cost heuristics (simple math is cheap udfs can be expensive and so can regexes) should be a net win.
+CC @shardulm94
Another possible solution would be to also break up the projection and move the part of the projection which is used in the filter down with the filter unless the only thing the projection is adding is the filter field in which case we'd leave it as is.
This logic starts to get more complex, but I think in that case it's probably more of a "pure" win (e.g. no downsides). WDYT @cloud-fan ?
Do folks have a preference between this approach & the one in https://github.com/apache/spark/pull/46143 ?
CC @cloud-fan do you have thoughts / cycles?
I've been thinking hard about it. Filter pushdown should always be beneficial if we don't duplicate expressions, and the new With
expression can avoid expression duplication.
So my proposal is: when we push down filter, and we are about to duplicate some expressions, let's use With
to avoid it. At the end of the optimizer, we run the rule RewriteWithExpression
to rewrite With
and pull out common expressions into a Project
below. Data source pushdown rule doesn't require the scan node to be the direct child of Filter
, so everything should work as before.
Let me take a look at the with functionality but that sounds potentially reasonable.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!