Blog post about TopK filter pushdown
Is your feature request related to a problem or challenge?
@adriangb has a great PR to add dynamic filtering with TopK in DataFusion here:
- https://github.com/apache/datafusion/pull/15301
Describe the solution you'd like
It would be really nice to write a blog post about this optimization in DataFusion (and maybe call out how it works)
Here are some related work
- Snowflake blog: Reimagining Top-K Aggregation Optimization at Snowflake
- Dynamic Filters in Starburst,
High level points
- Study DuckDB plan https://github.com/apache/datafusion/issues/15177 (hos much faster)
- Went back and forth to design something that could be general purpose
- Did benchmarking,
- Results are amazing, etc
Describe alternatives you've considered
No response
Additional context
No response
take
Thanks @aaryyya -- note we'll need to actually finish the work before we can publish a blog
Of course, blog driven development does work pretty well -- it is basically we @clflushopt @scsmithr and I did with another project (wrote the outline of the blog we wanted to see and then wrote the code to make it happen 😆 )
- https://github.com/apache/datafusion-site/pull/67
I think we're ready to publish a blogpost!
Now all we need to do is find time to write one 😆
I think a great visual is to have is some chart showing the performance improvement with/without topk pushdown
Basically like this https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
Here is my suggestion for a blog / outline:
The goal is a technical evangelism piece. The reader should come away having learned something about columnar query engines (not just that DataFusion is great, which it is!)
Title: Using Dynamic Filterers and to make TopK / LIMIT queries much faster
Structure:
Intro
w/ some sort of summary performance chart
Running example:
A simple example query -- I think the clickbench Q23 SELECT * FROM hits ORDER BY time DESC LIMIT 10 is a pretty good one as it is so simple but illustrates the point. More details can be summarized from https://github.com/apache/datafusion/issues/15177
Background
Show the plan for Q23
Explain the existing topk optimization (that there is a heap)
Explain that the query does much more work than necessary because it decodes all rows just to throw all but 10 of them away
Introduce the notion of filter pushdown and point out that DataFusion does it at multiple phases
- Listing table (prune files)
- During opening (prune files again)
- During row group / data page filtering
- During the scan (if
pushdown_filtersis on)
Dynamic Filters
Explain that the topk operator knows the minimum time that could be emitted after the plan started -- basically like WHERE time > (current min in top k).
However the current min isn't know at plan time
Then describe the summary technical approach, highlighting that you made it general purpose to aslo support SIPs and other user defined dynamic filters; Also highlight you worked with the community to do this
- Add an API for pushing down filter and introducing dynamic filters to ExecutionPlan trait
- Add appropriate APIs for updating those filters at runtime and adding new points to prune (e..g on file open)
Results
Show some sort of results if possible
Conclusion / Call to action
This will be released in DataFusion 49
Come help / join us / use DataFusion 🎣
I'd also like to list the "next steps" to 🎣 contributors:
- Support more join types: https://github.com/apache/datafusion/pull/17090
- Better synchronization of dynamic filter updates across partitions: https://github.com/apache/datafusion/pull/16433
- Push down more hash join information: https://github.com/apache/datafusion/issues/17171
- Implement dynamic filters for more types of joins (CrossJoin, NestedLoopJoin, etc.): ??
- Order files in each partition by the ORDER BY clause using statistics (better early termination): https://github.com/apache/datafusion/issues/17271
Initial draft is up! https://github.com/apache/datafusion-site/pull/102
as this feature is complete is this issue still open for documentation?
If there are updates or improvements you'd like to make this blog post or DataFusion's documentation it's very welcome, just make a new issue / PR!
The final URL is here: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/