OpenSearch [AUTOCUT] Gradle Check Flaky Test Report for FilteringAllocationIT

Flaky Test Report for `FilteringAllocationIT`

Noticed the FilteringAllocationIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference	Merged Pull Request	Build Details	Test Name
8a847f22a157f3cc1d8d851f04d9f7326a70cc78	18301	58128	`org.opensearch.cluster.allocation.FilteringAllocationIT.testDecommissionNodeNoReplicas`

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the FilteringAllocationIT class are:

18255
18358

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

May 15 '25 18:05 opensearch-ci-bot

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.FilteringAllocationIT.testDecommissionNodeNoReplicas" -Dtests.seed=EAFBBB1F7B5F6BFD -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-DJ -Dtests.timezone=Asia/Seoul -Druntime.java=21

FilteringAllocationIT > testDecommissionNodeNoReplicas FAILED
    Failed to execute phase [query], all shards failed
        at app//org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:775)
        at app//org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:395)
        at app//org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:815)
        at app//org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:548)
        at app//org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:290)
        at app//org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:373)
        at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at app//org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78)
        at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at app//org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59)
        at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:975)
        at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.****@21.0.7/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.****@21.0.7/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.****@21.0.7/java.lang.Thread.run(Thread.java:1583)

    java.lang.NullPointerException: Cannot invoke "org.opensearch.gateway.TransportNodesGatewayStartedShardHelper$GatewayStartedShard.allocationId()" because "shardData" is null
        at org.opensearch.gateway.PrimaryShardBatchAllocator.lambda$adaptToNodeShardStates$0(PrimaryShardBatchAllocator.java:153)
        at java.****/java.util.HashMap.forEach(HashMap.java:1429)
        at org.opensearch.gateway.PrimaryShardBatchAllocator.adaptToNodeShardStates(PrimaryShardBatchAllocator.java:149)
        at org.opensearch.gateway.PrimaryShardBatchAllocator.allocateUnassignedBatch(PrimaryShardBatchAllocator.java:115)
        at org.opensearch.gateway.ShardsBatchGatewayAllocator$3.run(ShardsBatchGatewayAllocator.java:330)
        at org.opensearch.common.util.BatchRunnableExecutor.run(BatchRunnableExecutor.java:54)
        at java.****/java.util.Optional.ifPresent(Optional.java:178)
        at org.opensearch.cluster.routing.allocation.AllocationService.allocateAllUnassignedShards(AllocationService.java:653)
        at org.opensearch.cluster.routing.allocation.AllocationService.allocateExistingUnassignedShards(AllocationService.java:625)
        at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:601)
        at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:566)
        at org.opensearch.cluster.metadata.MetadataDeleteIndexService.deleteIndices(MetadataDeleteIndexService.java:198)
        at org.opensearch.action.admin.indices.datastream.DeleteDataStreamAction$TransportAction.removeDataStream(DeleteDataStreamAction.java:275)
        at org.opensearch.action.admin.indices.datastream.DeleteDataStreamAction$TransportAction$1.execute(DeleteDataStreamAction.java:226)
        at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
        at org.opensearch.cluster.service.ClusterManagerService.executeTasks(ClusterManagerService.java:889)
        at org.opensearch.cluster.service.ClusterManagerService.calculateTaskOutputs(ClusterManagerService.java:441)
        at org.opensearch.cluster.service.ClusterManagerService.runTasks(ClusterManagerService.java:301)
        at org.opensearch.cluster.service.ClusterManagerService$Batcher.run(ClusterManagerService.java:214)
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:206)
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:264)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
        at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.****/java.lang.Thread.run(Thread.java:1583)

May 15 '25 21:05 andrross

Looks like a null pointer exception here:

https://github.com/opensearch-project/OpenSearch/blob/125b77300eb783f502a23ec0329c77e3d7170494/server/src/main/java/org/opensearch/gateway/PrimaryShardBatchAllocator.java#L153

May 15 '25 21:05 andrross

@amkhar @gargmanik13 @shwetathareja Can someone take a look here? This might be related to the batch allocation change. I don't see any history of this IT failing in the past.

May 15 '25 21:05 andrross

@SwethaGuptha can you please take a look.

May 20 '25 09:05 shwetathareja

Issue was reproducible on running the test multiple times:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.FilteringAllocationIT.testDecommissionNodeNoReplicas" -Dtests.seed=EAFBBB1F7B5F6BFD -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-DJ -Dtests.timezone=Asia/Seoul -Druntime.java=21  -Dtests.iters=150 --info

With some additional logs observed that the batch cache entry is empty for node 1 that don't hold any shard, this seems like a valid state because node 1 had no shards assigned. A fix is required to handle empty shard list.

   [2025-05-29T01:17:24,402][INFO ][o.o.c.a.FilteringAllocationIT] [testDecommissionNodeNoReplicas] --> verify all are allocated on node0 now
    [2025-05-29T01:17:24,403][INFO ][o.o.c.m.MetadataIndexStateService] [node_t0] opening indices [[test/8jonKUN0QqmEkfiXTnk9VQ]]
    [2025-05-29T01:17:24,403][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[test/8jonKUN0QqmEkfiXTnk9VQ]
    [2025-05-29T01:17:24,404][INFO ][o.o.c.r.a.AllocationService] [node_t0] Applying reroute for reasons: indices opened [[[test/8jonKUN0QqmEkfiXTnk9VQ]]]
    [2025-05-29T01:17:24,436][INFO ][o.o.c.r.a.AllocationService] [node_t0] Applying reroute for reasons: async_shard_batch_fetch
    [2025-05-29T01:17:24,437][INFO ][o.o.g.S.InternalPrimaryBatchShardAllocator] [node_t0] Shard data is null, for node {node_t1}{61civlcPQM-l84nvIDJz8g}{v7lcsrkqSMmfRCdIMpi0pw}{127.0.0.1}{127.0.0.1:62276}{dimr}{shard_indexing_pressure_enabled=true} batch data {{node_t0}{lx4flO3ARI6jrggtrhdGxA}{FvaM1SVZSK6PqRr4KjgIfQ}{127.0.0.1}{127.0.0.1:62275}{dimr}{shard_indexing_pressure_enabled=true}=NodeGatewayStartedShardsBatch{nodeGatewayStartedShardsBatch={[test][0]=NodeGatewayStartedShards[allocationId=08ZAlBBrQKGi6DFYTyg0fw,primary=true]}}, {node_t1}{61civlcPQM-l84nvIDJz8g}{v7lcsrkqSMmfRCdIMpi0pw}{127.0.0.1}{127.0.0.1:62276}{dimr}{shard_indexing_pressure_enabled=true}=NodeGatewayStartedShardsBatch{nodeGatewayStartedShardsBatch={}}} and unassigned shards [test][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=INDEX_REOPENED], at[2025-05-28T16:17:24.404Z], delayed=false, allocation_status[fetching_shard_data]]
    [2025-05-29T01:17:24,437][ERROR][o.o.c.r.BatchedRerouteService] [node_t0] unexpected failure during [cluster_reroute(async_shard_batch_fetch)], current state version [14]
    java.lang.NullPointerException: Cannot invoke "org.opensearch.gateway.TransportNodesGatewayStartedShardHelper$GatewayStartedShard.allocationId()" because "shardData" is null
        at org.opensearch.gateway.PrimaryShardBatchAllocator.adaptToNodeShardStates(PrimaryShardBatchAllocator.java:158) ~[main/:?]
        at org.opensearch.gateway.PrimaryShardBatchAllocator.allocateUnassignedBatch(PrimaryShardBatchAllocator.java:115) ~[main/:?]
        at org.opensearch.gateway.ShardsBatchGatewayAllocator$3.run(ShardsBatchGatewayAllocator.java:330) ~[main/:?]
        at org.opensearch.common.util.BatchRunnableExecutor.run(BatchRunnableExecutor.java:54) ~[main/:?]
        at java.base/java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
        at org.opensearch.cluster.routing.allocation.AllocationService.allocateAllUnassignedShards(AllocationService.java:654) ~[main/:?]
        at org.opensearch.cluster.routing.allocation.AllocationService.allocateExistingUnassignedShards(AllocationService.java:626) ~[main/:?]
        at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:602) ~[main/:?]
        at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:567) ~[main/:?]
        at org.opensearch.cluster.routing.BatchedRerouteService$1.execute(BatchedRerouteService.java:136) ~[main/:?]
        at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67) ~[main/:?]

Jun 12 '25 05:06 SwethaGuptha

Hi, @andrross @SwethaGuptha I first found this test in #18255's gradle test, spent some time analyzing it, and submitted a PR for the fix. Can you please take a look?

Jun 17 '25 09:06 guojialiang92