pinot icon indicating copy to clipboard operation
pinot copied to clipboard

PERCENTILEEST throws NPE in MultiStage Engine

Open MeihanLi opened this issue 1 year ago • 4 comments

we found in Pinot 1.1, PERCENTILEEST throws NPE in MultiStage Engine. Does anyone know if this is already fixed in master?

example query:

WITH sub AS (select deviceOS, count(1) as total_count from userAttributes group by 1 limit 100000),
subx AS (select deviceOS, count(1) as total_count from userAttributes group by 1 limit 100000),
union_table AS (select deviceOS, total_count from sub UNION ALL select deviceOS, total_count from subx)
select deviceOS, percentile(total_count, 50) from union_table group by deviceOS 
image

MeihanLi avatar Jun 03 '24 16:06 MeihanLi

Yes, it should be fixed within #13282

Jackie-Jiang avatar Jun 03 '24 23:06 Jackie-Jiang

Issue still exists. I am looking into it.

ankitsultana avatar Jul 02 '24 00:07 ankitsultana

@Jackie-Jiang : looks like current master still can't handle aggregations with literals in some cases. Repro is below.

I took a quick look and see the following issues:

  • Calcite is converting RexLiteral to a ref (this is done before plan optimization).
  • Calcite is not helpful and pushes down the literal argument to the agg function to the table scan. I feel this might even be considered a bug in Calcite.
  • [separate issue] We are generating an exchange for Percentile. Instead we should do a full aggregation directly.

I see a bunch of attempts at fixing this: #13282, #13217 and #11105.

I can take a look at this as well but wondering if you folks are already working on this and have any learnings to share. One potential solution I wanted to play with was:

Identify literals that Calcite converts to a ref and remove them. Wherever the refs are used, we should replace it with a literal. This should be done before exchange and other optimizations are run.

SET useMultistageEngine=true;

with
  teamOne as (
  	select playerName, teamID, sum(runs) as sum_of_runs from baseballStats where teamID = 'SFN' group by playerName, teamID
  ),
  teamTwo as (
	select playerName, teamID, sum(runs) as sum_of_runs from baseballStats where teamID = 'CHN' group by playerName, teamID
  ),
  all as (
	select playerName, teamID, sum_of_runs from teamOne union all select playerName, teamID, sum_of_runs from teamTwo
  )
  select 
    teamID, 
	-- sum(sum_of_runs) 
	percentile(sum_of_runs, 50) 
  from all
  group by teamID

You can also try adding this hint: /*+ aggOptions(is_skip_leaf_stage_group_by='true') */ which will still throw an error because the rexliteral gets converted to a ref by Calcite (before optimization).

ankitsultana avatar Jul 02 '24 16:07 ankitsultana

@ankitsultana Take a look at PinotAggregateExchangeNodeInsertRule, where we should already replace all ref to literal to literal. Maybe there are some corner cases not handled (e.g. the ref is not in a project)

Jackie-Jiang avatar Jul 02 '24 20:07 Jackie-Jiang