Do not push down filter through distinct on
Which issue does this PR close?
Closes #12942.
Rationale for this change
Written in issue.
What changes are included in this PR?
Fix for PushDownFilter with DistinctOn.
Are these changes tested?
Yes
Are there any user-facing changes?
No
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look
Thank you @epsio-banay and @findepi and @eejbyfeldt for your help
Are we certain that the filter should not be pushed down, and moreover that it can be pushed down in some cases but not in other cases?
This is unexpected to me, as my intuition assumes that the WHERE clause always filters the pre-aggregated data (i.e. it is always pushed down)—double checking on Postgres this seems to be confirmed
postgres@localhost:postgres> explain select distinct on (a) a, b from foo where b = 1 order by a, b desc;
+-----------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------|
| Unique (cost=38.44..38.50 rows=11 width=8) |
| -> Sort (cost=38.44..38.47 rows=11 width=8) |
| Sort Key: a |
| -> Seq Scan on foo (cost=0.00..38.25 rows=11 width=8) |
| Filter: (b = 1) |
+-----------------------------------------------------------------+
Granted with DISTINCT ON you can't use HAVING (as there's no explicit GROUP BY), but I guess one should be using a subquery with an additional filter on top of the nested DISTINCT ON in such cases.
This is unexpected to me, as my intuition assumes that the WHERE clause always filters the pre-aggregated data (i.e. it is always pushed down)
I don't think "always pushed down" is the correct description here. My understanding is that it would be planned as a filter before the aggregation and therefore would not needed to be pushed down. In DataFusion this can be seen if we do
> EXPLAIN VERBOSE select distinct on (a) a, b from foo where b = 1 order by a, b desc;
+------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan | DistinctOn: on_expr=[[foo.a]], select_expr=[[foo.a, foo.b]], sort_expr=[[foo.a ASC NULLS FIRST, foo.b DESC NULLS LAST]] |
| | Filter: foo.b = Int64(1) |
| | TableScan: foo
...
To get the filter above the aggregation (in postgres) we can do
> explain select * from (select distinct on (a) a, b from foo) t where b = 1 order by a, b desc;
QUERY PLAN
-------------------------------------------------------------------------
Subquery Scan on t (cost=158.51..172.31 rows=1 width=8)
Filter: (t.b = 1)
-> Unique (cost=158.51..169.81 rows=200 width=8)
-> Sort (cost=158.51..164.16 rows=2260 width=8)
Sort Key: foo.a
-> Seq Scan on foo (cost=0.00..32.60 rows=2260 width=8)
e.g not pushed down and if we change the filter to a = 1 we get
> explain select * from (select distinct on (a) a, b from foo) t where a = 1 order by a, b desc;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=38.55..38.58 rows=11 width=8)
Sort Key: foo.b DESC
-> Unique (cost=0.00..38.25 rows=11 width=8)
-> Seq Scan on foo (cost=0.00..38.25 rows=11 width=8)
Filter: (a = 1)
so this filter gets pushed down.
So postgres seems to agree with us that the filter can be pushed down in some case and not in others.
To get the filter above the aggregation (in postgres) we can do
Ahh good point. I think my misunderstanding here is that I was concentrating on the WHERE clause within the DISTINCT ON (sub)query.
So yes, I agree we should push down only compatible expressions from outside of the DISTINCT ON subquery, but the filter from within that (sub)query should always be pushed down (and I think there's a test missing for that).
Marking as a draft as I think this PR is no longer waiting on feedback (and I am trying to work down the review queue).
Please mark it as ready for review when it is ready for another look
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.