spark [SPARK-47672][SQL] Avoid double eval from filter pushDown

What changes were proposed in this pull request?

Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.

This introduces an "expectedCost" mechanism which we may or may not want. Previous filter ordering work used filter pushdowns as an approximation of expression cost but here we need more granularity. As an alternative we could introduce a flag for expensive rather than numeric operations.

Future Work / What else remains to do?

Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.

Why are the changes needed?

Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.

Does this PR introduce any user-facing change?

SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.

How was this patch tested?

New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist

Was this patch authored or co-authored using generative AI tooling?

No

Apr 02 '24 00:04 holdenk

oh this is a hard one. The cost of predicates is hard to estimate, and also the benefit as we need to estimate the selectivity and the input data volume.

cc @kelvinjian-db @jchen5

Apr 05 '24 02:04 cloud-fan

It is. In general I think since we still apply the filter post projection if a user has created a projection with a named field and then filtered on that field the user is probably doing that intentionally since they don't want to double eval the named field. That plus some basic cost heuristics (simple math is cheap udfs can be expensive and so can regexes) should be a net win.

Apr 05 '24 03:04 holdenk

+CC @shardulm94

Apr 05 '24 03:04 mridulm

Another possible solution would be to also break up the projection and move the part of the projection which is used in the filter down with the filter unless the only thing the projection is adding is the filter field in which case we'd leave it as is.

This logic starts to get more complex, but I think in that case it's probably more of a "pure" win (e.g. no downsides). WDYT @cloud-fan ?

Apr 11 '24 00:04 holdenk

Do folks have a preference between this approach & the one in https://github.com/apache/spark/pull/46143 ?

May 06 '24 17:05 holdenk

CC @cloud-fan do you have thoughts / cycles?

May 08 '24 18:05 holdenk

I've been thinking hard about it. Filter pushdown should always be beneficial if we don't duplicate expressions, and the new With expression can avoid expression duplication.

So my proposal is: when we push down filter, and we are about to duplicate some expressions, let's use With to avoid it. At the end of the optimizer, we run the rule RewriteWithExpression to rewrite With and pull out common expressions into a Project below. Data source pushdown rule doesn't require the scan node to be the direct child of Filter, so everything should work as before.

May 09 '24 01:05 cloud-fan

Let me take a look at the with functionality but that sounds potentially reasonable.

May 09 '24 03:05 holdenk

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Aug 21 '24 00:08 github-actions[bot]

spark spark copied to clipboard

[SPARK-47672][SQL] Avoid double eval from filter pushDown

What changes were proposed in this pull request?

Future Work / What else remains to do?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard