pinot
pinot copied to clipboard
Aggregation query's shouldn't return rows when there isn't any data
We noticed that there is inconsistency in the way aggregation function queries return data. For example
select sum(count_field) from table where application='xxxx'
This will return 1 row with value 0, even when I make up a application value that doesn't exist. If however, I added a time filter to this,
select created_15min, sum(count_field) from table where application = 'xxxx' and created_15min >= 1664582400000
group by 1
this returns 0 rows.
I believe the correct response here should be zero rows, especially since for that first query for example, numDocsScanned is 0
note that this table has time and partition pruning. It seems when all segments are pruned, you get back no rows. But if not all segments are pruned, then you get back a single row with the default value even if there were no matching documents.
For this query, we can get away with this behavior since a sum of 0 is effectively the same as no rows found. But for these queries
select min(field) from table where key_field = 'fake_value'
-- Infinity
select max(field) from table where key_field = 'fake_value'
-- -Infinity
this is problematic behavior
What is the standard SQL behavior when no record matches? Does it return null as the aggregation result? cc @walterddr
This is the tradition that postgres follows. which is what i know that follows SQL standard the most for aggregation without group by
SELECT SUM(a) FROM tbl WHERE a > 0 AND a < -1it will return a single row / single column of 0SELECT MAX(a) FROM tbl WHERE a > 0 AND a < -1it will return a single row / single column of null
for agg with group by, since there's no group key will match this impossible filter, thus it will return 0 rows
I tested locally with postgres's internal tables. It always returns a single row with null
postgres=# select max(stawidth) from pg_catalog.pg_statistic;
max
------
1237
(1 row)
postgres=# select max(stawidth) from pg_catalog.pg_statistic where stawidth=-100;
max
-----
(1 row)
postgres=# select min(stawidth) from pg_catalog.pg_statistic where stawidth=-100;
min
-----
(1 row)
postgres=# select sum(stawidth) from pg_catalog.pg_statistic where stawidth=-100;
sum
-----
(1 row)
postgres=# select coalesce(sum(stawidth), 0) from pg_catalog.pg_statistic where stawidth=-100;
coalesce
----------
0
(1 row)
postgres=# select coalesce(max(stawidth), 0) from pg_catalog.pg_statistic where stawidth=-100;
coalesce
----------
0
(1 row)
returning any value seems semantically wrong. imagine you have a table like
| A | B |
|---|---|
| meow | -1 |
| meow | 1 |
select sum(B) from table where A='meow' makes sense to return 0
select sum(B) from table where A='woof' would just be wrong if it returns 0
oops. i meant to say COUNT instead of SUM in my previous comment for the non-null return. ^
but yes for these cases one row needs to be returned. consider a semi join example.
SELECT * FROM tbl WHERE key_count IN (SELECT count(*) FROM tbl GROUP BY key);
^ it does make sense to return an empty array for the IN predicate on the RHS, but
SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);
^ it doesn't make sense to me to return an empty object for the > predicate on the RHS
@jadami10 wondering if you saw Rong's last message ^^ He's talking about a case where returning empty array doesn't work.
apologies missed this. Maybe the multi-stage engine supports this, but these are not valid queries at the moment in Pinot, so I don't think they apply.
SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);
I don't really understand this query. What is count? Is it a field in tbl?
apologies missed this. Maybe the multi-stage engine supports this, but these are not valid queries at the moment in Pinot, so I don't think they apply.
yes they are valid queries in v2 engine as well as in ANSI standard SQL semantics.
SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);I don't really understand this query. What is
count? Is it a field intbl?
yes assume "count" is a column in tbl
I'm honestly not sure what the semantics would be here. I'm thinking more about the semantics of single aggregations. They shouldn't just return the default value if no rows match; it should be null. We should match known sql semantics, and v2 should handle whatever that is instead of the other way around.
yes I agree with you @jadami10. i merely mentioned that
- returning a single row of
0for count - returning a single row of
nullfor sum/min/max - i am not sure what's suppose to be the right return for avg (as it will be a null / 0)
is correct SQL semantics.
i will say to follow postgres behavior as must as possible, please let me know if you agree with my observation
Oh yes. Totally agree with those. Thank you!
yeah also V2 is expecting the postgres behavior. because the same behavior occurs no matter the query is executed at top level or sub-query level. that's why I mentioned
SELECT * FROM tbl WHERE count > (SELECT count(*) FROM tbl);
as an example of why count(*) when no row on tbl should return 1 row of 0 instead of 0 row
thanks for confirming, we are on the same page.