prql Corrections of SQL's aggregation functions

From Postgres docs:

It should be noted that except for count, these functions return a null value when no rows are selected. In particular, sum of no rows returns null, not zero as one might expect, and array_agg returns null rather than an empty array when there are no input rows. The coalesce function can be used to substitute zero or an empty array for null when necessary.

This is in SQL standard:

sqlite> SELECT count(a) FROM (SELECT 5 as a WHERE false) t;
+----------+
| count(a) |
+----------+
| 0        |
+----------+
sqlite> SELECT sum(a) FROM (SELECT 5 as a WHERE false) t;
+--------+
| sum(a) |
+--------+
|        |
+--------+

I believe that aggregation functions with no input rows should have more reasonable defaults. In some cases this is should be zero-objects (the object which doesn't have effect if it's reapplied). For sum that is 0, for string_agg it is an empty string.

So I propose that:

from employees
select (sum salary)

... compiles to ...

SELECT COALESCE(SUM(salary), 0) FROM employees

Sep 22 '22 18:09 aljazerzen

To be more precise, I would use these defaults:

function	result on 0 rows
count	0 (same as SQL)
array_agg	[]
sum	0
any	false
every	true
string_agg	''
avg	null (same as SQL)
max	null (same as SQL)
min	null (same as SQL)

Sep 22 '22 18:09 aljazerzen

Yes, this moves us away from the SQL "standard" a bit, makes the generated SQL a bit less concise — but I'm a 👍 given the practical implications and theoretical elegance

It's also mostly how the python ecosystem behaves.

Sep 22 '22 18:09 max-sixty

Hmm interesting. I believe this is also what APL does.

That said, my first instinct is that I'm actually not in favour, mostly because I see it as information loss. If I write something like from t | aggregate [sum a] it would be useful to know that getting null means there are no rows (of course it could also just mean there was a null in the data).

I'm also biased against moving away from the SQL standard. Not saying I can't be convinced but my prior is against it.

Can you give an example where this would be useful, just so that I can picture it?

It seems to me that it is difficult to recover the lost information whereas it's easy enough to guard against it, eg from t | aggregate [(sum a) ?? 0].

What are the default values in other languages? Just unsure about max and min whether they shouldn't be -inf and inf respectively (although obviously only for floats)?

Sep 22 '22 21:09 snth

The inspiration or rather frustration for this issue came from dealing with a bug in SQL. The code dealt with conversion from jsonb array into postgres array:

SELECT array_agg(value) FROM jsonb_array_elements(my_jsonb_array)

This converts jsonb array into a set that can be used as a table, which we then aggregate into array. As one would expect this always produces an array, except we pass an empty jsonb array: to_jsonb('[]'). In this case it produces NULL.

In other languages, aggregation can be implemented with reduction or collectors. For example in JavaScript:

// sum
my_array.reduce((sum, value) => sum + value, 0)

// array_agg
my_array.reduce((array, value) => array.concat([value]), [])

Reduction works by providing:

a "reducer" function, which combines previous result and a new value from the iterator and
an initial value.

Now, if iterator has no elements, the initial value is returned.

I believe that we should define our aggregation function as to have a natural generalization to empty columns. And that can be achieved by writing an implementation of the function as a reduction with an initial value.

Sep 24 '22 09:09 aljazerzen

prql prql copied to clipboard

Corrections of SQL's aggregation functions

prql
prql copied to clipboard