datafusion Do not push down filter through distinct on

Which issue does this PR close?

Closes #12942.

Rationale for this change

Written in issue.

What changes are included in this PR?

Fix for PushDownFilter with DistinctOn.

Are these changes tested?

Yes

Are there any user-facing changes?

No

Oct 15 '24 14:10 epsio-banay

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Oct 18 '24 19:10 alamb

Thank you @epsio-banay and @findepi and @eejbyfeldt for your help

Oct 18 '24 19:10 alamb

Are we certain that the filter should not be pushed down, and moreover that it can be pushed down in some cases but not in other cases?

This is unexpected to me, as my intuition assumes that the WHERE clause always filters the pre-aggregated data (i.e. it is always pushed down)—double checking on Postgres this seems to be confirmed

postgres@localhost:postgres> explain select distinct on (a) a, b from foo where b = 1 order by a, b desc;
+-----------------------------------------------------------------+
| QUERY PLAN                                                      |
|-----------------------------------------------------------------|
| Unique  (cost=38.44..38.50 rows=11 width=8)                     |
|   ->  Sort  (cost=38.44..38.47 rows=11 width=8)                 |
|         Sort Key: a                                             |
|         ->  Seq Scan on foo  (cost=0.00..38.25 rows=11 width=8) |
|               Filter: (b = 1)                                   |
+-----------------------------------------------------------------+

Granted with DISTINCT ON you can't use HAVING (as there's no explicit GROUP BY), but I guess one should be using a subquery with an additional filter on top of the nested DISTINCT ON in such cases.

Oct 29 '24 08:10 gruuya

This is unexpected to me, as my intuition assumes that the WHERE clause always filters the pre-aggregated data (i.e. it is always pushed down)

I don't think "always pushed down" is the correct description here. My understanding is that it would be planned as a filter before the aggregation and therefore would not needed to be pushed down. In DataFusion this can be seen if we do

> EXPLAIN VERBOSE select distinct on (a) a, b from foo where b = 1 order by a, b desc;

+------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | DistinctOn: on_expr=[[foo.a]], select_expr=[[foo.a, foo.b]], sort_expr=[[foo.a ASC NULLS FIRST, foo.b DESC NULLS LAST]]                                                                                                                                                                                                                                                                                                                                                                                                              |
|                                                            |   Filter: foo.b = Int64(1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|                                                            |     TableScan: foo                              
...

To get the filter above the aggregation (in postgres) we can do

> explain select * from (select distinct on (a) a, b from foo) t where b = 1 order by a, b desc;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Subquery Scan on t  (cost=158.51..172.31 rows=1 width=8)
   Filter: (t.b = 1)
   ->  Unique  (cost=158.51..169.81 rows=200 width=8)
         ->  Sort  (cost=158.51..164.16 rows=2260 width=8)
               Sort Key: foo.a
               ->  Seq Scan on foo  (cost=0.00..32.60 rows=2260 width=8)

e.g not pushed down and if we change the filter to a = 1 we get

> explain select * from (select distinct on (a) a, b from foo) t where a = 1 order by a, b desc;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Sort  (cost=38.55..38.58 rows=11 width=8)
   Sort Key: foo.b DESC
   ->  Unique  (cost=0.00..38.25 rows=11 width=8)
         ->  Seq Scan on foo  (cost=0.00..38.25 rows=11 width=8)
               Filter: (a = 1)

so this filter gets pushed down.

So postgres seems to agree with us that the filter can be pushed down in some case and not in others.

Oct 29 '24 12:10 eejbyfeldt

To get the filter above the aggregation (in postgres) we can do

Ahh good point. I think my misunderstanding here is that I was concentrating on the WHERE clause within the DISTINCT ON (sub)query.

So yes, I agree we should push down only compatible expressions from outside of the DISTINCT ON subquery, but the filter from within that (sub)query should always be pushed down (and I think there's a test missing for that).

Oct 29 '24 13:10 gruuya

Marking as a draft as I think this PR is no longer waiting on feedback (and I am trying to work down the review queue).

Please mark it as ready for review when it is ready for another look

Nov 03 '24 11:11 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Jan 03 '25 01:01 github-actions[bot]