trino
trino copied to clipboard
Get Null Pointer Exception in some queries doing array_agg
Hi team, Some of the queries with array_agg are failing with null pointer exception, stack trace given below.
Whats more weird is that if we run the query with more resources (num of workers and worker memory), it runs fine.
Version -: Reproducible in v391
Stack Trace -:
java.lang.NullPointerException: Cannot load from short array because "this.array[io.trino.array.BigArrays.segment(long)]" is null
at io.trino.array.ShortBigArray.get(ShortBigArray.java:74)
at io.trino.operator.aggregation.AbstractGroupCollectionAggregationState.isEmpty(AbstractGroupCollectionAggregationState.java:173)
at io.trino.operator.aggregation.arrayagg.ArrayAggregationFunction.output(ArrayAggregationFunction.java:61)
at io.trino.$gen.array_aggGroupedAccumulator_20220801_055623_81.evaluateFinal(Unknown Source)
at io.trino.operator.aggregation.OrderedAccumulatorFactory$OrderingGroupedAccumulator.evaluateFinal(OrderedAccumulatorFactory.java:232)
at io.trino.operator.aggregation.GroupedAggregator.evaluate(GroupedAggregator.java:96)
at io.trino.operator.aggregation.builder.InMemoryHashAggregationBuilder.lambda$buildResult$2(InMemoryHashAggregationBuilder.java:289)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:509)
at io.trino.operator.Driver.processInternal(Driver.java:410)
at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
at io.trino.operator.Driver.tryWithLock(Driver.java:698)
at io.trino.operator.Driver.process(Driver.java:305)
at io.trino.operator.Driver.processForDuration(Driver.java:276)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:740)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:164)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:490)
at io.trino.$gen.Trino_391_2_g5fb3da2____20220801_052834_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Diff Stat from our internal build -:
.gitlab-ci.yml | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ci-build.sh | 32 +++++++++++++++++++++++++
ci-release.sh | 26 ++++++++++++++++++++
pom.xml | 1 +
Please let me know if I can contribute somehow.
cc @sopel39 @skrzypo987 @raunaqmorarka
@singhrajk did it start to fail in 391?
cc @dain
@sopel39 it fails in 374 as well. it works well in 350
@singhrajk can you repro with tpch/tpcds?
@singhrajk could you also point to first version that fails?
Without a reproduction, I don't think it will be possible to figure this out.
My guess is the problem has to do with the OrderedAccumulatorFactory
. The code should work if ensure capacity is called on the aggregator. Where this is called can get complex when you have distributed aggregations, but ordered aggregations are not distributed; all of the work happens in that class. It appears that either prepareFinal
is not being called or the max group id in the OrderingGroupedAccumulator
is out of sync with groupByHash.getGroupCount()
. They both seem unlikely, but the first one seems really unlikely.
Able to reproduce the issue with this query -:
with t1 as (
select cast(null as varchar) as id, cast(uuid() as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 90000
), t2 as (
select cast(uuid() as varchar) as id, cast(null as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 1000000
), t as (
select id, v, t from t1
union all
select id, v, t from t2
)
select
id,
array_agg( v order by t desc) filter (where v is not null) as v
from t
group by id
Fails with a 10 worker node cluster (with requests.memory = 85 GB), works with 20 worker node cluster.
Hi @dain @sopel39 @findepi, Any updates on this?
@dain this query indeed fails for me even on single node:
with t1 as (
select cast(null as varchar) as id, cast(uuid() as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 90000
), t2 as (
select cast(uuid() as varchar) as id, cast(null as varchar) as v, ss_item_sk as t from tpcds.sf100.store_sales limit 1000000
), t as (
select id, v, t from t1
union all
select id, v, t from t2
)
select
id,
array_agg( v order by t desc) filter (where v is not null) as v
from t
group by id
Query 20220811_085947_00000_jj4e8 failed: Cannot load from short array because "this.array[io.trino.array.BigArrays.segment(long)]" is null
Thanks for the PR, just wondering why does it work if we throw more resources at it?