embucket-labs
embucket-labs copied to clipboard
Implement DISTINCT in aggregate functions in window
There is an error in datafusion: Distinct aggregate functions not supported in window expressions.
In fact it never shown up because we call window functions with is_distinct=false;
This way, such queries will produce wrong results:
CREATE OR REPLACE TABLE sales AS
SELECT * FROM (
VALUES
(1, 'Alice', 'East', 100),
(2, 'Bob', 'East', 200),
(3, 'Alice', 'East', 100),
(4, 'Bob', 'East', 300),
(5, 'Carol', 'West', 400),
(6, 'Carol', 'West', 100)
) AS t(id, employee, region, amount);
SELECT
id,
employee,
amount,
SUM(DISTINCT amount) OVER (PARTITION BY employee) AS sum_distinct_by_employee
FROM sales;
Datafusion result:
+----+----------+--------+--------------------------+
| id | employee | amount | sum_distinct_by_employee |
+----+----------+--------+--------------------------+
| 1 | Alice | 100 | 200 |
| 3 | Alice | 100 | 200 |
| 2 | Bob | 200 | 500 |
| 4 | Bob | 300 | 500 |
| 5 | Carol | 400 | 500 |
| 6 | Carol | 100 | 500 |
+----+----------+--------+--------------------------+
Snowflake result:
ID | EMPLOYEE | AMOUNT | SUM_DISTINCT_BY_EMPLOYEE
-- | -- | -- | --
1 | Alice | 100 | 100
2 | Bob | 200 | 500
3 | Alice | 100 | 100
4 | Bob | 300 | 500
5 | Carol | 400 | 500
6 | Carol | 100 | 500