embucket-labs icon indicating copy to clipboard operation
embucket-labs copied to clipboard

Implement DISTINCT in aggregate functions in window

Open Vedin opened this issue 6 months ago • 0 comments

There is an error in datafusion: Distinct aggregate functions not supported in window expressions. In fact it never shown up because we call window functions with is_distinct=false; This way, such queries will produce wrong results:

    CREATE OR REPLACE TABLE sales AS
SELECT * FROM (
  VALUES
    (1, 'Alice', 'East', 100),

    (2, 'Bob',   'East', 200),
    (3, 'Alice', 'East', 100),
    (4, 'Bob',   'East', 300),
    (5, 'Carol', 'West', 400),
    (6, 'Carol', 'West', 100)
) AS t(id, employee, region, amount);

SELECT
  id,
  employee,
  amount,
  SUM(DISTINCT amount) OVER (PARTITION BY employee) AS sum_distinct_by_employee
FROM sales;

Datafusion result:

+----+----------+--------+--------------------------+
| id | employee | amount | sum_distinct_by_employee |
+----+----------+--------+--------------------------+
| 1  | Alice    | 100    | 200                      |
| 3  | Alice    | 100    | 200                      |
| 2  | Bob      | 200    | 500                      |
| 4  | Bob      | 300    | 500                      |
| 5  | Carol    | 400    | 500                      |
| 6  | Carol    | 100    | 500                      |
+----+----------+--------+--------------------------+

Snowflake result:

ID | EMPLOYEE | AMOUNT | SUM_DISTINCT_BY_EMPLOYEE
-- | -- | -- | --
1 | Alice | 100 | 100
2 | Bob | 200 | 500
3 | Alice | 100 | 100
4 | Bob | 300 | 500
5 | Carol | 400 | 500
6 | Carol | 100 | 500

Vedin avatar Jun 05 '25 19:06 Vedin