dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] Enable `test_agg_count`

Open andygrove opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe.

The eliminate_agg_filter rule fails with a duplicate field error when running test_agg_count.

This is the query:

            SELECT
                a, b, a+1 AS x,
                COUNT(c) AS c_c,
                COUNT(DISTINCT c) AS cd_c,
                COUNT(d) AS c_d,
                COUNT(DISTINCT d) AS cd_d,
                COUNT(e) AS c_e,
                COUNT(DISTINCT e) AS cd_e
            FROM a GROUP BY a, b ORDER BY
                a NULLS FIRST,
                b NULLS FIRST

The optimizer rule creates these two plans, which it then tries to join, resulting in duplicate field "a.a".

=== JOIN INPUT ===
Projection: #a.a, #a.b, #a.a AS COUNT(a.c), #a.b AS COUNT(DISTINCT a.c)
  Aggregate: groupBy=[[#a.a, #a.b]], aggr=[[SUM(#__dask_sql_count__1), COUNT(#a.c)]]
    Aggregate: groupBy=[[#a.a, #a.b, #a.c]], aggr=[[COUNT(#a.c) AS __dask_sql_count__1]]
      TableScan: a projection=[a, b, c, d, e]
=== JOIN INPUT ===
Projection: #a.a, #a.b, #a.a AS COUNT(a.d), #a.b AS COUNT(DISTINCT a.d)
  Aggregate: groupBy=[[#a.a, #a.b]], aggr=[[SUM(#__dask_sql_count__2), COUNT(#a.d)]]
    Aggregate: groupBy=[[#a.a, #a.b, #a.d]], aggr=[[COUNT(#a.d) AS __dask_sql_count__2]]
      TableScan: a projection=[a, b, c, d, e]

Describe the solution you'd like Rather than performing a cross-join as we do when there is no grouping expression, we need to join these individual queries on the grouping keys and only use the grouping keys once in the final projection.

Describe alternatives you've considered None

Additional context None

andygrove avatar Sep 21 '22 22:09 andygrove

Issue #532 is also about test_agg_count, so both can be closed with this fix.

sarahyurick avatar Sep 22 '22 00:09 sarahyurick