Support datafusion dynamic filter expressions
Starting with DataFusion's 49 release, they support dynamic filter expressions, which give a significant performance boost in cases like top-k queries.
A lot of performance wins in DataFusion are driven by this change, so we should push it to Vortex competitive.
While they do support them internally they never push them down through optimizer so we have to wait for more datafusion features to land before we do anything here. Let's reopen this once there's a release we can test the feature with
I have forgotten that we filter expression, it's there
Just to expand on the context - I would love to support datafusion dynamic expressions, but they are extremely flexible. They probably don't fit with our current dynamic expression implementation (that was built for DuckDB). I've made an attempt that I didn't push all the way to the end to add a new expression just to wrap datafusion dynamic expressions, but I'm not sure how to handle cases where the inner expression is actually something we don't know how to translate, because once we say we accept it, I'm not sure upstream physical operators will run it again.
I'm not sure how to handle cases where the inner expression is actually something we don't know how to translate, because once we say we accept it, I'm not sure upstream physical operators will run it again
I think in general you should push down the expression and do a best-effort to evaluate it at runtime but say No/Unsupported for dynamic expressions. This is always safe. Most likely it won't be "re evaluated" upstream e.g. HashJoinExec produces a dynamic physical expression but then doesn't actually use it. TopK I think does use it to pre-filter rows before comparing them in the heap, in theory it could make use of pushdown information to skip this but that seems like a complex API to get right, probably not worth the effort.
In terms of what to do with / how to use those expressions once they've been pushed down: Parquet is able to evaluate arbitrary expressions because it does it once it has Arrow arrays in memory. Since Vortex needs to create a Vortex predicate out of a DataFusion one I think you should call PhysicalExpr::simplify() as late as possible (i.e. in the FileOpener)to get back a static expression. This does reconstruct the entire tree, it's not the cheapest. Might be worth calling is_dynamic_physical_expr or something beforehand to check, but that seems like a detail. I'd guess a good place to do it would be TryFromDataFusion<dyn PhysicalExpr> for ExprRef. You'll also have to update can_be_pushed_down or its usage to allow these filters to end up in the Vortex predicate.
The good news is that all of the current filters simplify into very basic binary comparisons (e.g. col > 5). We have plans to add some more complex filters e.g. pushing down join hash tables, I guess Vortex just won't be able to use those. Would you be able to push down a bloom filter if that was what DataFusion produced?
Just to expand on the context - I would love to support datafusion dynamic expressions, but they are extremely flexible. They probably don't fit with our current dynamic expression implementation (that was built for DuckDB).
It's not too late to make changes to dynamic filter expressions in DataFusion. Do you have any insights on how they work in DuckDB? Any lessons we can learn?
The general difficulty we have as a framework that wants to do its own filtering we have our own expression representation and as such we can't just take a datafusion expression since they assume arrow data input. We could wrap it up in such a way that we fallback to arrow if we can't translate the expression but it's unclear if that's better for overall execution or not, it probably is but it's unnecessary complexity. For comparison DuckDB gives us a structure of the expression (here's the translation logic in rust from their representation to vortex https://github.com/vortex-data/vortex/blob/3a78b90553d3bb419bdd8167d90fc2f7955cf034/vortex-duckdb/src/convert/table_filter.rs#L76-L106) which is col <comparison> <dynamic value>, it's a lot more restrictive api that's easier to adapt, most importantly though, duckdb doesn't assume the type of the data input to the expression, just gives us an expression
I see that makes sense. I considered a structure like that for DataFusion but decided against it because:
- It wasn't always obvious what to initialize
<dynamic value>with when it was unknown, e.g. for strings. - We wanted to support arbitrary expressions being pushed down like
IN LIST, a reference to a hash join's hash table, etc.
The second could be solved by having more than one type of dynamic expression (maybe a completely dynamic expression and one in terms of a dynamic literal expression).
In any case I think if you use the approach suggested in https://github.com/vortex-data/vortex/issues/4034#issuecomment-3404464535 calling simplify() and such it should work out.