pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Aggregation on array types does not apply filter to the aggreation

Open jadami10 opened this issue 2 years ago • 4 comments

This is related to #5069, but a clearly bigger bug. When you filter the multivalue col with an aggregation, you still see all the unique values from the multivalue col.

Table:

foo array_col
3 [1,2]
select count(foo) where array_col = 1 from table group by array_col;
array_col count_foo
[1] 1
[2] 1

There is a working query for this, but it requires specifying the same filter twice and adding an extra column to your result

select VALUEIN(array_col, 1), count(foo) where array_col = 1 from table group by array_col;
array_col count_foo
[1] 1

jadami10 avatar Jan 05 '24 17:01 jadami10

cc @xiangfu0, curious if this is intended or actually a bug

jadami10 avatar Jan 05 '24 17:01 jadami10

IMO this seems like a bug. b/c the behavior of

SELECT mvCol FROM tbl WHERE mvCol = 1

will return all the mvCol which CONTAINS the value 1.

thus the behavior of exploding the mvCol during group by should happen after the filter.

use Postgres as example, the logic can be express (without filter)

-- direct unnest
SELECT unnest(arr) AS arr_item, SUM(id) 
FROM test 
GROUP BY 1
-- nested subquery unnest
SELECT arr_item, SUM(id) FROM (
  SELECT unnest(arr) AS arr_item, id FROM test) t 
GROUP BY 1

however it really depending on how MV column is being considered here when filter is added:

-- add to the outer query 
SELECT arr_item, SUM(id) FROM (
  SELECT unnest(arr) AS arr_item, id FROM test) t  
WHERE arr_item = 1
GROUP BY 1

-- add to the inner query
SELECT arr_item, SUM(id) FROM (
  SELECT unnest(arr) AS arr_item, id FROM test WHERE 1=ANY(arr) ) t  
GROUP BY 1

i felt like the inner query version is more acceptable in the MV context b/c it is not possible to add the arr_item = 1 filter against the direct unnest version of the query b/c unnest(arr) itself is not allowed to appear in filter.

walterddr avatar Jan 06 '24 18:01 walterddr

thus the behavior of exploding the mvCol during group by should happen after the filter.

well that is what is happening. It seems we fetch all the rows with the filtered column, then explode it.

we were discussing internally, and technically if the behavior were to change such that the filter removes the exploded values, then there would be no way to say "group by all MV values where any value equaled X". So maybe this is fine as is?

jadami10 avatar Jan 06 '24 23:01 jadami10

I wouldn't call this intended, but this behavior is expected, and I know people using it as a feature to get relevant values (think of a column storing all the skills of a person, these skills are usually relevant). The reason for this behavior is that pinot performs filtering and projection separately, and it simply returns the rows matching the filter. VALUE_IN is added to perform another filter after projection.

Jackie-Jiang avatar Jan 19 '24 06:01 Jackie-Jiang