trino icon indicating copy to clipboard operation
trino copied to clipboard

Get Null Pointer Exception in some queries doing array_agg

Open singhrajk opened this issue 2 years ago • 9 comments

Hi team, Some of the queries with array_agg are failing with null pointer exception, stack trace given below.

Whats more weird is that if we run the query with more resources (num of workers and worker memory), it runs fine.

Version -: Reproducible in v391

Stack Trace -:

java.lang.NullPointerException: Cannot load from short array because "this.array[io.trino.array.BigArrays.segment(long)]" is null
	at io.trino.array.ShortBigArray.get(ShortBigArray.java:74)
	at io.trino.operator.aggregation.AbstractGroupCollectionAggregationState.isEmpty(AbstractGroupCollectionAggregationState.java:173)
	at io.trino.operator.aggregation.arrayagg.ArrayAggregationFunction.output(ArrayAggregationFunction.java:61)
	at io.trino.$gen.array_aggGroupedAccumulator_20220801_055623_81.evaluateFinal(Unknown Source)
	at io.trino.operator.aggregation.OrderedAccumulatorFactory$OrderingGroupedAccumulator.evaluateFinal(OrderedAccumulatorFactory.java:232)
	at io.trino.operator.aggregation.GroupedAggregator.evaluate(GroupedAggregator.java:96)
	at io.trino.operator.aggregation.builder.InMemoryHashAggregationBuilder.lambda$buildResult$2(InMemoryHashAggregationBuilder.java:289)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:509)
	at io.trino.operator.Driver.processInternal(Driver.java:410)
	at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
	at io.trino.operator.Driver.tryWithLock(Driver.java:698)
	at io.trino.operator.Driver.process(Driver.java:305)
	at io.trino.operator.Driver.processForDuration(Driver.java:276)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:740)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:164)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:490)
	at io.trino.$gen.Trino_391_2_g5fb3da2____20220801_052834_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

Diff Stat from our internal build -:

.gitlab-ci.yml  | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ci-build.sh    |  32 +++++++++++++++++++++++++
 ci-release.sh  |  26 ++++++++++++++++++++
 pom.xml        |   1 +

Please let me know if I can contribute somehow.

singhrajk avatar Aug 02 '22 05:08 singhrajk

cc @sopel39 @skrzypo987 @raunaqmorarka

findepi avatar Aug 02 '22 14:08 findepi

@singhrajk did it start to fail in 391?

sopel39 avatar Aug 02 '22 15:08 sopel39

cc @dain

sopel39 avatar Aug 02 '22 15:08 sopel39

@sopel39 it fails in 374 as well. it works well in 350

singhrajk avatar Aug 02 '22 16:08 singhrajk

@singhrajk can you repro with tpch/tpcds?

sopel39 avatar Aug 02 '22 17:08 sopel39

@singhrajk could you also point to first version that fails?

sopel39 avatar Aug 02 '22 17:08 sopel39

Without a reproduction, I don't think it will be possible to figure this out.

My guess is the problem has to do with the OrderedAccumulatorFactory. The code should work if ensure capacity is called on the aggregator. Where this is called can get complex when you have distributed aggregations, but ordered aggregations are not distributed; all of the work happens in that class. It appears that either prepareFinal is not being called or the max group id in the OrderingGroupedAccumulator is out of sync with groupByHash.getGroupCount(). They both seem unlikely, but the first one seems really unlikely.

dain avatar Aug 02 '22 18:08 dain

Able to reproduce the issue with this query -:

with t1 as (
	select cast(null as varchar) as id, cast(uuid() as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 90000
), t2 as (
	select cast(uuid() as varchar) as id, cast(null as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 1000000
), t as (
	select id, v, t from t1
	union all 
	select id, v, t from t2
)
select 
	id,
	array_agg( v order by t desc) filter (where v is not null) as v
from t
group by id

Fails with a 10 worker node cluster (with requests.memory = 85 GB), works with 20 worker node cluster.

singhrajk avatar Aug 05 '22 08:08 singhrajk

Hi @dain @sopel39 @findepi, Any updates on this?

singhrajk avatar Aug 10 '22 03:08 singhrajk

@dain this query indeed fails for me even on single node:

with t1 as (
	select cast(null as varchar) as id, cast(uuid() as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 90000
), t2 as (
	select cast(uuid() as varchar) as id, cast(null as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 1000000
), t as (
	select id, v, t from t1
	union all 
	select id, v, t from t2
)
select 
	id,
	array_agg( v order by t desc) filter (where v is not null) as v
from t
group by id


Query 20220811_085947_00000_jj4e8 failed: Cannot load from short array because "this.array[io.trino.array.BigArrays.segment(long)]" is null

sopel39 avatar Aug 11 '22 09:08 sopel39

Thanks for the PR, just wondering why does it work if we throw more resources at it?

singhrajk avatar Aug 12 '22 06:08 singhrajk