Aggregation on array types does not apply filter to the aggreation
This is related to #5069, but a clearly bigger bug. When you filter the multivalue col with an aggregation, you still see all the unique values from the multivalue col.
Table:
| foo | array_col |
|---|---|
| 3 | [1,2] |
select count(foo) where array_col = 1 from table group by array_col;
| array_col | count_foo |
|---|---|
| [1] | 1 |
| [2] | 1 |
There is a working query for this, but it requires specifying the same filter twice and adding an extra column to your result
select VALUEIN(array_col, 1), count(foo) where array_col = 1 from table group by array_col;
| array_col | count_foo |
|---|---|
| [1] | 1 |
cc @xiangfu0, curious if this is intended or actually a bug
IMO this seems like a bug. b/c the behavior of
SELECT mvCol FROM tbl WHERE mvCol = 1
will return all the mvCol which CONTAINS the value 1.
thus the behavior of exploding the mvCol during group by should happen after the filter.
use Postgres as example, the logic can be express (without filter)
-- direct unnest
SELECT unnest(arr) AS arr_item, SUM(id)
FROM test
GROUP BY 1
-- nested subquery unnest
SELECT arr_item, SUM(id) FROM (
SELECT unnest(arr) AS arr_item, id FROM test) t
GROUP BY 1
however it really depending on how MV column is being considered here when filter is added:
-- add to the outer query
SELECT arr_item, SUM(id) FROM (
SELECT unnest(arr) AS arr_item, id FROM test) t
WHERE arr_item = 1
GROUP BY 1
-- add to the inner query
SELECT arr_item, SUM(id) FROM (
SELECT unnest(arr) AS arr_item, id FROM test WHERE 1=ANY(arr) ) t
GROUP BY 1
i felt like the inner query version is more acceptable in the MV context b/c it is not possible to add the arr_item = 1 filter against the direct unnest version of the query b/c unnest(arr) itself is not allowed to appear in filter.
thus the behavior of exploding the mvCol during group by should happen after the filter.
well that is what is happening. It seems we fetch all the rows with the filtered column, then explode it.
we were discussing internally, and technically if the behavior were to change such that the filter removes the exploded values, then there would be no way to say "group by all MV values where any value equaled X". So maybe this is fine as is?
I wouldn't call this intended, but this behavior is expected, and I know people using it as a feature to get relevant values (think of a column storing all the skills of a person, these skills are usually relevant).
The reason for this behavior is that pinot performs filtering and projection separately, and it simply returns the rows matching the filter. VALUE_IN is added to perform another filter after projection.