prql
prql copied to clipboard
Aggregate doesn't respect union's names
What happened?
Group aggregation after append doesn't preserve column alias (nor order) which leads to broken aggregations
PRQL input
prql target:sql.snowflake
let shape = p -> (
p
select {
interaction_day = s"DATE_TRUNC('day', ts)",
user_id = s"substr(user, length('u:') + 1)",
}
)
from table1
shape
append (from table2 | shape)
group { user_id } (
aggregate {
daily_interaction = count interaction_day
}
)
SQL output
WITH table_0 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table2
),
table_1 AS (
SELECT
substr(user, length('u:') + 1) AS user_id,
DATE_TRUNC('day', ts) AS _expr_0
FROM
table1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
user_id,
COUNT(*) AS daily_interaction
FROM
table_1
GROUP BY
user_id
-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)
Expected SQL output
WITH table_0 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table2
),
table_1 AS (
SELECT
substr(user, length('u:') + 1) AS user_id,
DATE_TRUNC('day', ts) AS interaction_day
FROM
table1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
user_id,
COUNT(*) AS daily_interaction
FROM
table_1
GROUP BY
user_id
-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)
MVCE confirmation
- [X] Minimal example
- [ ] New issue
Anything else?
No response
Thanks @avikam .
(For others looking — the error is DATE_TRUNC('day', ts) AS _expr_0)
The naming of the column as _expr_0 is one error. The other is that the order of the columns in the first part of table_1 before the UNION (as was pointed out in the original report).
I believe the Expected SQL Output should actually be:
WITH table_0 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table2
),
table_1 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
user_id,
COUNT(*) AS daily_interaction
FROM
table_1
GROUP BY
user_id
-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)
yeah it might be two separate issues.
the order issue can be reproduced with:
from tbl1
select { y, x }
append (
from tbl2
select { y, x }
)
group { x } (
aggregate {
tot = count y
}
)
which currently transpiles as
WITH table_0 AS (
SELECT
y,
x
FROM
tbl2
),
table_1 AS (
SELECT
x,
y
FROM
tbl1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
x,
COUNT(*) AS tot
FROM
table_1
GROUP BY
x
looking more into it, i think the ordering issue can be reproduced even without an aggregation. Consider this example:
from tbl1
select { x, y }
append (from tbl2 | select { x, y } )
select { y, x}
This pipeline sequence is not split, and the last select dominates the projection of tbl1
select y, x from tbl1 union all select * from table_0 -- table_0 is a correct CTE
the reason i think it is related to the first issue, is that in that case there is a Union followed by Select, Aggregate which does introduce a split. Then anchor_split appends a projection transform, with columns that their order is determined by the Aggregate's partition and compute arguments. When these are not in the same order as the first select, we get a similar behavior.
all of that should be taken with a grain of salt, as this all is very speculative.