prql icon indicating copy to clipboard operation
prql copied to clipboard

Cast inside aggregation fails with type error

Open snth opened this issue 2 years ago • 6 comments

What's up?

What happened?

Cast inside aggregation fails with type error.

PRQL input

from artists
derive {artist_int = (artist_id | as int)}
aggregate {
  total_int = sum artist_int,
  total_fail = sum (artist_id | as int),
}

SQL output

Error: 
   ╭─[:5:33]
   │
 5 │   total_fail = sum (artist_id | as int),
   │                                 ───┬──  
   │                                    ╰──── function std.sum, param `column` expected type `array`, but found type `scalar`
   │ 
   │ Help: Type `array` expands to `[null]`
───╯

Expected SQL output

SELECT
  COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_int,
  COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_fail
FROM
  artists

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

MVCE confirmation

  • [X] Minimal example
  • [X] New issue

Anything else?

No response

snth avatar Sep 18 '23 09:09 snth

@max-sixty I've copied your comment over:

Thanks, this does indeed look like a bug.

For the moment, a workaround is to create the column outside the aggregation (as on the derive line above).

snth avatar Sep 18 '23 09:09 snth

This is a symptom of #2723

as implies that the column is of type scalar (actually it should be int, but we are not there yet).

sum expects an array of scalars.

The compiler is basically saying "so, which one is it? What is the type of (artist_id | as int)?"

aljazerzen avatar Nov 09 '23 08:11 aljazerzen

I don't think I agree. All column expressions are always expressed as if operating on a single value with the understanding that they are implicitly "mapped"/broadcast over all rows.

What is the type of (artist_id | as int)?"

A column of ints, which can also be seen from the fact that

from artists
derive {artist_int = (artist_id | as int)}
aggregate {
  total_int = sum artist_int,
#  total_fail = sum (artist_id | as int),
}

compiles to

SELECT
  COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_int
FROM
  artists

-- Generated by PRQL compiler version:0.10.0 (https://prql-lang.org)

So in my mind, the bug report is about why does it work when defined with a separate derive but not when the same expression is the argument to a sum aggregation?

snth avatar Nov 10 '23 14:11 snth

What is the type of (artist_id | as int)?"

A column of ints, which can also be seen from the fact that

If it a column of ints, then the cast should be artist_id | as [int], no? I think this boils down to what is our understanding of the PRQL type system, which we talk about in #2723, and which has not been closed yet.

Current implementation of the compiler and the type system both stem from my understanding of the language, where [int] and int are two distinct types and there is no "operator broadcasting" that would broadcast as int to each value in the column. But as I said, this is not something that has been decided on, we need to finish #2723

aljazerzen avatar Nov 10 '23 14:11 aljazerzen

Totally agree about finishing #2723 , but in the meantime how is this different to any of the other type inference that's going on?

What are the types of income and income > 5 in the first few lines of the Playground example?

from invoices
derive {
  transaction_fee = 0.8,
  income = total - transaction_fee
}
filter income > 5

Assuming <transaction_fee> == float and <total> == [float], then

  • <income> == [float] - float == [float] ?
  • <income > 5> == [float] > int == [bool] ?

I think that's not how these things are usually understood though. Usually everything is interpreted at a single row / tuple level and the types are rather:

  • <transaction_fee> == float
  • <total> == float
  • <income> == float - float == float
  • <income > 5> == float > int == bool

So similarly it should be that <artist_id | as int> == int.

snth avatar Nov 10 '23 14:11 snth

Yes, I very much agree on your conclusion of what the types should be.

Going forward, I would also say that in this query:

from invoices
derive {
  transaction_fee = 0.8,
  income = total - transaction_fee
}
aggregate { total = sum income }
  • <total - transaction_fee> == float
  • type of the income within derive is float,
  • type of the income within aggregate is [float], as sum cannot operate on a single float, but requires an array

So that would answer your question of "why can I cast within derive, but not within aggregate". derive operates row-wise, but aggregate operates column-wise; it applies aggregation to the whole column at once.

aljazerzen avatar Nov 10 '23 16:11 aljazerzen