trino Get Null Pointer Exception in some queries doing array

Hi team, Some of the queries with array_agg are failing with null pointer exception, stack trace given below.

Whats more weird is that if we run the query with more resources (num of workers and worker memory), it runs fine.

Version -: Reproducible in v391

Stack Trace -:

java.lang.NullPointerException: Cannot load from short array because "this.array[io.trino.array.BigArrays.segment(long)]" is null
	at io.trino.array.ShortBigArray.get(ShortBigArray.java:74)
	at io.trino.operator.aggregation.AbstractGroupCollectionAggregationState.isEmpty(AbstractGroupCollectionAggregationState.java:173)
	at io.trino.operator.aggregation.arrayagg.ArrayAggregationFunction.output(ArrayAggregationFunction.java:61)
	at io.trino.$gen.array_aggGroupedAccumulator_20220801_055623_81.evaluateFinal(Unknown Source)
	at io.trino.operator.aggregation.OrderedAccumulatorFactory$OrderingGroupedAccumulator.evaluateFinal(OrderedAccumulatorFactory.java:232)
	at io.trino.operator.aggregation.GroupedAggregator.evaluate(GroupedAggregator.java:96)
	at io.trino.operator.aggregation.builder.InMemoryHashAggregationBuilder.lambda$buildResult$2(InMemoryHashAggregationBuilder.java:289)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:509)
	at io.trino.operator.Driver.processInternal(Driver.java:410)
	at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
	at io.trino.operator.Driver.tryWithLock(Driver.java:698)
	at io.trino.operator.Driver.process(Driver.java:305)
	at io.trino.operator.Driver.processForDuration(Driver.java:276)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:740)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:164)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:490)
	at io.trino.$gen.Trino_391_2_g5fb3da2____20220801_052834_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

Diff Stat from our internal build -:

.gitlab-ci.yml  | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ci-build.sh    |  32 +++++++++++++++++++++++++
 ci-release.sh  |  26 ++++++++++++++++++++
 pom.xml        |   1 +

Please let me know if I can contribute somehow.

Aug 02 '22 05:08 singhrajk

cc @sopel39 @skrzypo987 @raunaqmorarka

Aug 02 '22 14:08 findepi

@singhrajk did it start to fail in 391?

Aug 02 '22 15:08 sopel39

cc @dain

Aug 02 '22 15:08 sopel39

@sopel39 it fails in 374 as well. it works well in 350

Aug 02 '22 16:08 singhrajk

@singhrajk can you repro with tpch/tpcds?

Aug 02 '22 17:08 sopel39

@singhrajk could you also point to first version that fails?

Aug 02 '22 17:08 sopel39

Without a reproduction, I don't think it will be possible to figure this out.

My guess is the problem has to do with the OrderedAccumulatorFactory. The code should work if ensure capacity is called on the aggregator. Where this is called can get complex when you have distributed aggregations, but ordered aggregations are not distributed; all of the work happens in that class. It appears that either prepareFinal is not being called or the max group id in the OrderingGroupedAccumulator is out of sync with groupByHash.getGroupCount(). They both seem unlikely, but the first one seems really unlikely.

Aug 02 '22 18:08 dain

Able to reproduce the issue with this query -:

with t1 as (
	select cast(null as varchar) as id, cast(uuid() as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 90000
), t2 as (
	select cast(uuid() as varchar) as id, cast(null as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 1000000
), t as (
	select id, v, t from t1
	union all 
	select id, v, t from t2
)
select 
	id,
	array_agg( v order by t desc) filter (where v is not null) as v
from t
group by id

Fails with a 10 worker node cluster (with requests.memory = 85 GB), works with 20 worker node cluster.

Aug 05 '22 08:08 singhrajk

Hi @dain @sopel39 @findepi, Any updates on this?

Aug 10 '22 03:08 singhrajk

@dain this query indeed fails for me even on single node:

with t1 as (
	select cast(null as varchar) as id, cast(uuid() as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 90000
), t2 as (
	select cast(uuid() as varchar) as id, cast(null as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 1000000
), t as (
	select id, v, t from t1
	union all 
	select id, v, t from t2
)
select 
	id,
	array_agg( v order by t desc) filter (where v is not null) as v
from t
group by id


Query 20220811_085947_00000_jj4e8 failed: Cannot load from short array because "this.array[io.trino.array.BigArrays.segment(long)]" is null

Aug 11 '22 09:08 sopel39

Thanks for the PR, just wondering why does it work if we throw more resources at it?

Aug 12 '22 06:08 singhrajk

trino
trino copied to clipboard

Get Null Pointer Exception in some queries doing array_agg

trino trino copied to clipboard

Get Null Pointer Exception in some queries doing array_agg

trino
trino copied to clipboard