prql
prql copied to clipboard
Cast inside aggregation fails with type error
What's up?
What happened?
Cast inside aggregation fails with type error.
PRQL input
from artists
derive {artist_int = (artist_id | as int)}
aggregate {
total_int = sum artist_int,
total_fail = sum (artist_id | as int),
}
SQL output
Error:
╭─[:5:33]
│
5 │ total_fail = sum (artist_id | as int),
│ ───┬──
│ ╰──── function std.sum, param `column` expected type `array`, but found type `scalar`
│
│ Help: Type `array` expands to `[null]`
───╯
Expected SQL output
SELECT
COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_int,
COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_fail
FROM
artists
-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)
MVCE confirmation
- [X] Minimal example
- [X] New issue
Anything else?
No response
@max-sixty I've copied your comment over:
Thanks, this does indeed look like a bug.
For the moment, a workaround is to create the column outside the aggregation (as on the derive line above).
This is a symptom of #2723
as implies that the column is of type scalar (actually it should be int, but we are not there yet).
sum expects an array of scalars.
The compiler is basically saying "so, which one is it? What is the type of (artist_id | as int)?"
I don't think I agree. All column expressions are always expressed as if operating on a single value with the understanding that they are implicitly "mapped"/broadcast over all rows.
What is the type of (artist_id | as int)?"
A column of ints, which can also be seen from the fact that
from artists
derive {artist_int = (artist_id | as int)}
aggregate {
total_int = sum artist_int,
# total_fail = sum (artist_id | as int),
}
compiles to
SELECT
COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_int
FROM
artists
-- Generated by PRQL compiler version:0.10.0 (https://prql-lang.org)
So in my mind, the bug report is about why does it work when defined with a separate derive but not when the same expression is the argument to a sum aggregation?
What is the type of (artist_id | as int)?"
A column of ints, which can also be seen from the fact that
If it a column of ints, then the cast should be artist_id | as [int], no? I think this boils down to what is our understanding of the PRQL type system, which we talk about in #2723, and which has not been closed yet.
Current implementation of the compiler and the type system both stem from my understanding of the language, where [int] and int are two distinct types and there is no "operator broadcasting" that would broadcast as int to each value in the column. But as I said, this is not something that has been decided on, we need to finish #2723
Totally agree about finishing #2723 , but in the meantime how is this different to any of the other type inference that's going on?
What are the types of income and income > 5 in the first few lines of the Playground example?
from invoices
derive {
transaction_fee = 0.8,
income = total - transaction_fee
}
filter income > 5
Assuming <transaction_fee> == float and <total> == [float], then
<income> == [float] - float == [float]?<income > 5> == [float] > int == [bool]?
I think that's not how these things are usually understood though. Usually everything is interpreted at a single row / tuple level and the types are rather:
<transaction_fee> == float<total> == float<income> == float - float == float<income > 5> == float > int == bool
So similarly it should be that <artist_id | as int> == int.
Yes, I very much agree on your conclusion of what the types should be.
Going forward, I would also say that in this query:
from invoices
derive {
transaction_fee = 0.8,
income = total - transaction_fee
}
aggregate { total = sum income }
<total - transaction_fee> == float- type of the
incomewithinderiveisfloat, - type of the
incomewithinaggregateis[float], assumcannot operate on a singlefloat, but requires an array
So that would answer your question of "why can I cast within derive, but not within aggregate". derive operates row-wise, but aggregate operates column-wise; it applies aggregation to the whole column at once.