datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Blog post about TopK filter pushdown

Open alamb opened this issue 8 months ago • 4 comments

Is your feature request related to a problem or challenge?

@adriangb has a great PR to add dynamic filtering with TopK in DataFusion here:

  • https://github.com/apache/datafusion/pull/15301

Describe the solution you'd like

It would be really nice to write a blog post about this optimization in DataFusion (and maybe call out how it works)

Here are some related work

High level points

  • Study DuckDB plan https://github.com/apache/datafusion/issues/15177 (hos much faster)
  • Went back and forth to design something that could be general purpose
  • Did benchmarking,
  • Results are amazing, etc

Describe alternatives you've considered

No response

Additional context

No response

alamb avatar Mar 31 '25 20:03 alamb

take

aaryyya avatar Apr 06 '25 06:04 aaryyya

Thanks @aaryyya -- note we'll need to actually finish the work before we can publish a blog

Of course, blog driven development does work pretty well -- it is basically we @clflushopt @scsmithr and I did with another project (wrote the outline of the blog we wanted to see and then wrote the code to make it happen 😆 )

  • https://github.com/apache/datafusion-site/pull/67

alamb avatar Apr 06 '25 11:04 alamb

I think we're ready to publish a blogpost!

adriangb avatar Jun 18 '25 01:06 adriangb

Now all we need to do is find time to write one 😆

I think a great visual is to have is some chart showing the performance improvement with/without topk pushdown

Basically like this https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/

Image

alamb avatar Jun 18 '25 20:06 alamb

Here is my suggestion for a blog / outline:

The goal is a technical evangelism piece. The reader should come away having learned something about columnar query engines (not just that DataFusion is great, which it is!)

Title: Using Dynamic Filterers and to make TopK / LIMIT queries much faster

Structure:

Intro

w/ some sort of summary performance chart

Running example: A simple example query -- I think the clickbench Q23 SELECT * FROM hits ORDER BY time DESC LIMIT 10 is a pretty good one as it is so simple but illustrates the point. More details can be summarized from https://github.com/apache/datafusion/issues/15177

Background

Show the plan for Q23

Explain the existing topk optimization (that there is a heap)

Explain that the query does much more work than necessary because it decodes all rows just to throw all but 10 of them away

Introduce the notion of filter pushdown and point out that DataFusion does it at multiple phases

  • Listing table (prune files)
  • During opening (prune files again)
  • During row group / data page filtering
  • During the scan (if pushdown_filters is on)

Dynamic Filters

Explain that the topk operator knows the minimum time that could be emitted after the plan started -- basically like WHERE time > (current min in top k).

However the current min isn't know at plan time

Then describe the summary technical approach, highlighting that you made it general purpose to aslo support SIPs and other user defined dynamic filters; Also highlight you worked with the community to do this

  • Add an API for pushing down filter and introducing dynamic filters to ExecutionPlan trait
  • Add appropriate APIs for updating those filters at runtime and adding new points to prune (e..g on file open)

Results

Show some sort of results if possible

Conclusion / Call to action

This will be released in DataFusion 49

Come help / join us / use DataFusion 🎣

alamb avatar Jul 02 '25 09:07 alamb

I'd also like to list the "next steps" to 🎣 contributors:

  • Support more join types: https://github.com/apache/datafusion/pull/17090
  • Better synchronization of dynamic filter updates across partitions: https://github.com/apache/datafusion/pull/16433
  • Push down more hash join information: https://github.com/apache/datafusion/issues/17171
  • Implement dynamic filters for more types of joins (CrossJoin, NestedLoopJoin, etc.): ??
  • Order files in each partition by the ORDER BY clause using statistics (better early termination): https://github.com/apache/datafusion/issues/17271

adriangb avatar Aug 17 '25 15:08 adriangb

Initial draft is up! https://github.com/apache/datafusion-site/pull/102

adriangb avatar Aug 19 '25 23:08 adriangb

as this feature is complete is this issue still open for documentation?

aaryyya avatar Oct 16 '25 16:10 aaryyya

If there are updates or improvements you'd like to make this blog post or DataFusion's documentation it's very welcome, just make a new issue / PR!

adriangb avatar Oct 16 '25 16:10 adriangb

The final URL is here: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/

alamb avatar Oct 16 '25 19:10 alamb