Is your feature request related to a problem or challenge?

@adriangb has a great PR to add dynamic filtering with TopK in DataFusion here:

https://github.com/apache/datafusion/pull/15301

Describe the solution you'd like

It would be really nice to write a blog post about this optimization in DataFusion (and maybe call out how it works)

Here are some related work

High level points

Study DuckDB plan https://github.com/apache/datafusion/issues/15177 (hos much faster)
Went back and forth to design something that could be general purpose
Did benchmarking,
Results are amazing, etc

Describe alternatives you've considered

No response

Additional context

No response

Mar 31 '25 20:03 alamb

take

Apr 06 '25 06:04 aaryyya

Thanks @aaryyya -- note we'll need to actually finish the work before we can publish a blog

Of course, blog driven development does work pretty well -- it is basically we @clflushopt @scsmithr and I did with another project (wrote the outline of the blog we wanted to see and then wrote the code to make it happen 😆 )

https://github.com/apache/datafusion-site/pull/67

Apr 06 '25 11:04 alamb

I think we're ready to publish a blogpost!

Jun 18 '25 01:06 adriangb

Now all we need to do is find time to write one 😆

I think a great visual is to have is some chart showing the performance improvement with/without topk pushdown

Basically like this https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/

Jun 18 '25 20:06 alamb

Here is my suggestion for a blog / outline:

The goal is a technical evangelism piece. The reader should come away having learned something about columnar query engines (not just that DataFusion is great, which it is!)

Title: Using Dynamic Filterers and to make TopK / LIMIT queries much faster

Structure:

Intro

w/ some sort of summary performance chart

Running example: A simple example query -- I think the clickbench Q23 SELECT * FROM hits ORDER BY time DESC LIMIT 10 is a pretty good one as it is so simple but illustrates the point. More details can be summarized from https://github.com/apache/datafusion/issues/15177

Background

Show the plan for Q23

Explain the existing topk optimization (that there is a heap)

Explain that the query does much more work than necessary because it decodes all rows just to throw all but 10 of them away

Introduce the notion of filter pushdown and point out that DataFusion does it at multiple phases

Listing table (prune files)
During opening (prune files again)
During row group / data page filtering
During the scan (if pushdown_filters is on)

Dynamic Filters

Explain that the topk operator knows the minimum time that could be emitted after the plan started -- basically like WHERE time > (current min in top k).

However the current min isn't know at plan time

Then describe the summary technical approach, highlighting that you made it general purpose to aslo support SIPs and other user defined dynamic filters; Also highlight you worked with the community to do this

Add an API for pushing down filter and introducing dynamic filters to ExecutionPlan trait
Add appropriate APIs for updating those filters at runtime and adding new points to prune (e..g on file open)

Results

Show some sort of results if possible

Conclusion / Call to action

This will be released in DataFusion 49

Come help / join us / use DataFusion 🎣

Jul 02 '25 09:07 alamb

I'd also like to list the "next steps" to 🎣 contributors:

Support more join types: https://github.com/apache/datafusion/pull/17090
Better synchronization of dynamic filter updates across partitions: https://github.com/apache/datafusion/pull/16433
Push down more hash join information: https://github.com/apache/datafusion/issues/17171
Implement dynamic filters for more types of joins (CrossJoin, NestedLoopJoin, etc.): ??
Order files in each partition by the ORDER BY clause using statistics (better early termination): https://github.com/apache/datafusion/issues/17271

Aug 17 '25 15:08 adriangb

Initial draft is up! https://github.com/apache/datafusion-site/pull/102

Aug 19 '25 23:08 adriangb

as this feature is complete is this issue still open for documentation?

Oct 16 '25 16:10 aaryyya

If there are updates or improvements you'd like to make this blog post or DataFusion's documentation it's very welcome, just make a new issue / PR!

Oct 16 '25 16:10 adriangb

The final URL is here: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/

Oct 16 '25 19:10 alamb

Blog post about TopK filter pushdown

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Title: Using Dynamic Filterers and to make TopK / LIMIT queries much faster

Structure:

Intro

Background

Dynamic Filters

Results

Conclusion / Call to action