PERCENTILEEST throws NPE in MultiStage Engine
we found in Pinot 1.1, PERCENTILEEST throws NPE in MultiStage Engine. Does anyone know if this is already fixed in master?
example query:
WITH sub AS (select deviceOS, count(1) as total_count from userAttributes group by 1 limit 100000),
subx AS (select deviceOS, count(1) as total_count from userAttributes group by 1 limit 100000),
union_table AS (select deviceOS, total_count from sub UNION ALL select deviceOS, total_count from subx)
select deviceOS, percentile(total_count, 50) from union_table group by deviceOS
Yes, it should be fixed within #13282
Issue still exists. I am looking into it.
@Jackie-Jiang : looks like current master still can't handle aggregations with literals in some cases. Repro is below.
I took a quick look and see the following issues:
- Calcite is converting RexLiteral to a ref (this is done before plan optimization).
- Calcite is not helpful and pushes down the literal argument to the agg function to the table scan. I feel this might even be considered a bug in Calcite.
- [separate issue] We are generating an exchange for Percentile. Instead we should do a full aggregation directly.
I see a bunch of attempts at fixing this: #13282, #13217 and #11105.
I can take a look at this as well but wondering if you folks are already working on this and have any learnings to share. One potential solution I wanted to play with was:
Identify literals that Calcite converts to a ref and remove them. Wherever the refs are used, we should replace it with a literal. This should be done before exchange and other optimizations are run.
SET useMultistageEngine=true;
with
teamOne as (
select playerName, teamID, sum(runs) as sum_of_runs from baseballStats where teamID = 'SFN' group by playerName, teamID
),
teamTwo as (
select playerName, teamID, sum(runs) as sum_of_runs from baseballStats where teamID = 'CHN' group by playerName, teamID
),
all as (
select playerName, teamID, sum_of_runs from teamOne union all select playerName, teamID, sum_of_runs from teamTwo
)
select
teamID,
-- sum(sum_of_runs)
percentile(sum_of_runs, 50)
from all
group by teamID
You can also try adding this hint: /*+ aggOptions(is_skip_leaf_stage_group_by='true') */ which will still throw an error because the rexliteral gets converted to a ref by Calcite (before optimization).
@ankitsultana Take a look at PinotAggregateExchangeNodeInsertRule, where we should already replace all ref to literal to literal. Maybe there are some corner cases not handled (e.g. the ref is not in a project)