prql
prql copied to clipboard
Corrections of SQL's aggregation functions
From Postgres docs:
It should be noted that except for count, these functions return a null value when no rows are selected. In particular, sum of no rows returns null, not zero as one might expect, and array_agg returns null rather than an empty array when there are no input rows. The coalesce function can be used to substitute zero or an empty array for null when necessary.
This is in SQL standard:
sqlite> SELECT count(a) FROM (SELECT 5 as a WHERE false) t;
+----------+
| count(a) |
+----------+
| 0 |
+----------+
sqlite> SELECT sum(a) FROM (SELECT 5 as a WHERE false) t;
+--------+
| sum(a) |
+--------+
| |
+--------+
I believe that aggregation functions with no input rows should have more reasonable defaults. In some cases this is should be zero-objects (the object which doesn't have effect if it's reapplied). For sum
that is 0, for string_agg
it is an empty string.
So I propose that:
from employees
select (sum salary)
... compiles to ...
SELECT COALESCE(SUM(salary), 0) FROM employees
To be more precise, I would use these defaults:
function | result on 0 rows |
---|---|
count | 0 (same as SQL) |
array_agg | [] |
sum | 0 |
any | false |
every | true |
string_agg | '' |
avg | null (same as SQL) |
max | null (same as SQL) |
min | null (same as SQL) |
Yes, this moves us away from the SQL "standard" a bit, makes the generated SQL a bit less concise — but I'm a 👍 given the practical implications and theoretical elegance
It's also mostly how the python ecosystem behaves.
Hmm interesting. I believe this is also what APL does.
That said, my first instinct is that I'm actually not in favour, mostly because I see it as information loss. If I write something like from t | aggregate [sum a]
it would be useful to know that getting null
means there are no rows (of course it could also just mean there was a null
in the data).
I'm also biased against moving away from the SQL standard. Not saying I can't be convinced but my prior is against it.
Can you give an example where this would be useful, just so that I can picture it?
It seems to me that it is difficult to recover the lost information whereas it's easy enough to guard against it, eg from t | aggregate [(sum a) ?? 0]
.
What are the default values in other languages? Just unsure about max and min whether they shouldn't be -inf
and inf
respectively (although obviously only for floats)?
The inspiration or rather frustration for this issue came from dealing with a bug in SQL. The code dealt with conversion from jsonb array into postgres array:
SELECT array_agg(value) FROM jsonb_array_elements(my_jsonb_array)
This converts jsonb array into a set that can be used as a table, which we then aggregate into array. As one would expect this always produces an array, except we pass an empty jsonb array: to_jsonb('[]')
. In this case it produces NULL.
In other languages, aggregation can be implemented with reduction or collectors. For example in JavaScript:
// sum
my_array.reduce((sum, value) => sum + value, 0)
// array_agg
my_array.reduce((array, value) => array.concat([value]), [])
Reduction works by providing:
- a "reducer" function, which combines previous result and a new value from the iterator and
- an initial value.
Now, if iterator has no elements, the initial value is returned.
I believe that we should define our aggregation function as to have a natural generalization to empty columns. And that can be achieved by writing an implementation of the function as a reduction with an initial value.