dask-sql
dask-sql copied to clipboard
[ENH] Enable `test_agg_count`
Is your feature request related to a problem? Please describe.
The eliminate_agg_filter rule fails with a duplicate field error when running test_agg_count.
This is the query:
SELECT
a, b, a+1 AS x,
COUNT(c) AS c_c,
COUNT(DISTINCT c) AS cd_c,
COUNT(d) AS c_d,
COUNT(DISTINCT d) AS cd_d,
COUNT(e) AS c_e,
COUNT(DISTINCT e) AS cd_e
FROM a GROUP BY a, b ORDER BY
a NULLS FIRST,
b NULLS FIRST
The optimizer rule creates these two plans, which it then tries to join, resulting in duplicate field "a.a".
=== JOIN INPUT ===
Projection: #a.a, #a.b, #a.a AS COUNT(a.c), #a.b AS COUNT(DISTINCT a.c)
Aggregate: groupBy=[[#a.a, #a.b]], aggr=[[SUM(#__dask_sql_count__1), COUNT(#a.c)]]
Aggregate: groupBy=[[#a.a, #a.b, #a.c]], aggr=[[COUNT(#a.c) AS __dask_sql_count__1]]
TableScan: a projection=[a, b, c, d, e]
=== JOIN INPUT ===
Projection: #a.a, #a.b, #a.a AS COUNT(a.d), #a.b AS COUNT(DISTINCT a.d)
Aggregate: groupBy=[[#a.a, #a.b]], aggr=[[SUM(#__dask_sql_count__2), COUNT(#a.d)]]
Aggregate: groupBy=[[#a.a, #a.b, #a.d]], aggr=[[COUNT(#a.d) AS __dask_sql_count__2]]
TableScan: a projection=[a, b, c, d, e]
Describe the solution you'd like Rather than performing a cross-join as we do when there is no grouping expression, we need to join these individual queries on the grouping keys and only use the grouping keys once in the final projection.
Describe alternatives you've considered None
Additional context None
Issue #532 is also about test_agg_count, so both can be closed with this fix.