pinot Aggregation query's shouldn't return rows when there isn't any data

trafficstars

We noticed that there is inconsistency in the way aggregation function queries return data. For example

select sum(count_field) from table where application='xxxx'

This will return 1 row with value 0, even when I make up a application value that doesn't exist. If however, I added a time filter to this,

select created_15min, sum(count_field) from table where application = 'xxxx' and created_15min >= 1664582400000
group by 1

this returns 0 rows.

I believe the correct response here should be zero rows, especially since for that first query for example, numDocsScanned is 0

Oct 04 '22 19:10 priyen

note that this table has time and partition pruning. It seems when all segments are pruned, you get back no rows. But if not all segments are pruned, then you get back a single row with the default value even if there were no matching documents.

For this query, we can get away with this behavior since a sum of 0 is effectively the same as no rows found. But for these queries

select min(field) from table where key_field = 'fake_value'
-- Infinity

select max(field) from table where key_field = 'fake_value'
-- -Infinity

this is problematic behavior

Oct 04 '22 19:10 jadami10

What is the standard SQL behavior when no record matches? Does it return null as the aggregation result? cc @walterddr

Oct 04 '22 23:10 Jackie-Jiang

This is the tradition that postgres follows. which is what i know that follows SQL standard the most for aggregation without group by

SELECT SUM(a) FROM tbl WHERE a > 0 AND a < -1 it will return a single row / single column of 0
SELECT MAX(a) FROM tbl WHERE a > 0 AND a < -1 it will return a single row / single column of null

for agg with group by, since there's no group key will match this impossible filter, thus it will return 0 rows

Oct 04 '22 23:10 walterddr

I tested locally with postgres's internal tables. It always returns a single row with null

postgres=# select max(stawidth) from pg_catalog.pg_statistic;
 max
------
 1237
(1 row)

postgres=# select max(stawidth) from pg_catalog.pg_statistic where stawidth=-100;
 max
-----

(1 row)

postgres=# select min(stawidth) from pg_catalog.pg_statistic where stawidth=-100;
 min
-----

(1 row)

postgres=# select sum(stawidth) from pg_catalog.pg_statistic where stawidth=-100;
 sum
-----

(1 row)

postgres=# select coalesce(sum(stawidth), 0) from pg_catalog.pg_statistic where stawidth=-100;
 coalesce
----------
        0
(1 row)

postgres=# select coalesce(max(stawidth), 0) from pg_catalog.pg_statistic where stawidth=-100;
 coalesce
----------
        0
(1 row)

returning any value seems semantically wrong. imagine you have a table like

A	B
meow	-1
meow	1

select sum(B) from table where A='meow' makes sense to return 0 select sum(B) from table where A='woof' would just be wrong if it returns 0

Oct 05 '22 04:10 jadami10

oops. i meant to say COUNT instead of SUM in my previous comment for the non-null return. ^

but yes for these cases one row needs to be returned. consider a semi join example.

SELECT * FROM tbl WHERE key_count IN (SELECT count(*) FROM tbl GROUP BY key);

^ it does make sense to return an empty array for the IN predicate on the RHS, but

SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);

^ it doesn't make sense to me to return an empty object for the > predicate on the RHS

Oct 05 '22 14:10 walterddr

@jadami10 wondering if you saw Rong's last message ^^ He's talking about a case where returning empty array doesn't work.

Oct 13 '22 18:10 icefury71

apologies missed this. Maybe the multi-stage engine supports this, but these are not valid queries at the moment in Pinot, so I don't think they apply.

SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);

I don't really understand this query. What is count? Is it a field in tbl?

Oct 13 '22 22:10 jadami10

apologies missed this. Maybe the multi-stage engine supports this, but these are not valid queries at the moment in Pinot, so I don't think they apply.

yes they are valid queries in v2 engine as well as in ANSI standard SQL semantics.

SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);
I don't really understand this query. What is count? Is it a field in tbl?

yes assume "count" is a column in tbl

Oct 13 '22 22:10 walterddr

I'm honestly not sure what the semantics would be here. I'm thinking more about the semantics of single aggregations. They shouldn't just return the default value if no rows match; it should be null. We should match known sql semantics, and v2 should handle whatever that is instead of the other way around.

Oct 13 '22 22:10 jadami10

yes I agree with you @jadami10. i merely mentioned that

returning a single row of 0 for count
returning a single row of null for sum/min/max
i am not sure what's suppose to be the right return for avg (as it will be a null / 0)

is correct SQL semantics.

i will say to follow postgres behavior as must as possible, please let me know if you agree with my observation

Oct 13 '22 23:10 walterddr

Oh yes. Totally agree with those. Thank you!

Oct 13 '22 23:10 jadami10

yeah also V2 is expecting the postgres behavior. because the same behavior occurs no matter the query is executed at top level or sub-query level. that's why I mentioned

SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);

as an example of why count(*) when no row on tbl should return 1 row of 0 instead of 0 row

thanks for confirming, we are on the same page.

Oct 13 '22 23:10 walterddr

pinot pinot copied to clipboard

Aggregation query's shouldn't return rows when there isn't any data

pinot
pinot copied to clipboard