[AUTOCUT] Gradle Check Flaky Test Report for FilteringAllocationIT
Flaky Test Report for FilteringAllocationIT
Noticed the FilteringAllocationIT has some flaky, failing tests that failed during post-merge actions.
Details
| Git Reference | Merged Pull Request | Build Details | Test Name |
|---|---|---|---|
| 8a847f22a157f3cc1d8d851f04d9f7326a70cc78 | 18301 | 58128 | org.opensearch.cluster.allocation.FilteringAllocationIT.testDecommissionNodeNoReplicas |
The other pull requests, besides those involved in post-merge actions, that contain failing tests with the FilteringAllocationIT class are:
For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.FilteringAllocationIT.testDecommissionNodeNoReplicas" -Dtests.seed=EAFBBB1F7B5F6BFD -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-DJ -Dtests.timezone=Asia/Seoul -Druntime.java=21
FilteringAllocationIT > testDecommissionNodeNoReplicas FAILED
Failed to execute phase [query], all shards failed
at app//org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:775)
at app//org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:395)
at app//org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:815)
at app//org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:548)
at app//org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:290)
at app//org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:373)
at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at app//org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78)
at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at app//org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59)
at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:975)
at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.****@21.0.7/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.****@21.0.7/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.****@21.0.7/java.lang.Thread.run(Thread.java:1583)
java.lang.NullPointerException: Cannot invoke "org.opensearch.gateway.TransportNodesGatewayStartedShardHelper$GatewayStartedShard.allocationId()" because "shardData" is null
at org.opensearch.gateway.PrimaryShardBatchAllocator.lambda$adaptToNodeShardStates$0(PrimaryShardBatchAllocator.java:153)
at java.****/java.util.HashMap.forEach(HashMap.java:1429)
at org.opensearch.gateway.PrimaryShardBatchAllocator.adaptToNodeShardStates(PrimaryShardBatchAllocator.java:149)
at org.opensearch.gateway.PrimaryShardBatchAllocator.allocateUnassignedBatch(PrimaryShardBatchAllocator.java:115)
at org.opensearch.gateway.ShardsBatchGatewayAllocator$3.run(ShardsBatchGatewayAllocator.java:330)
at org.opensearch.common.util.BatchRunnableExecutor.run(BatchRunnableExecutor.java:54)
at java.****/java.util.Optional.ifPresent(Optional.java:178)
at org.opensearch.cluster.routing.allocation.AllocationService.allocateAllUnassignedShards(AllocationService.java:653)
at org.opensearch.cluster.routing.allocation.AllocationService.allocateExistingUnassignedShards(AllocationService.java:625)
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:601)
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:566)
at org.opensearch.cluster.metadata.MetadataDeleteIndexService.deleteIndices(MetadataDeleteIndexService.java:198)
at org.opensearch.action.admin.indices.datastream.DeleteDataStreamAction$TransportAction.removeDataStream(DeleteDataStreamAction.java:275)
at org.opensearch.action.admin.indices.datastream.DeleteDataStreamAction$TransportAction$1.execute(DeleteDataStreamAction.java:226)
at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
at org.opensearch.cluster.service.ClusterManagerService.executeTasks(ClusterManagerService.java:889)
at org.opensearch.cluster.service.ClusterManagerService.calculateTaskOutputs(ClusterManagerService.java:441)
at org.opensearch.cluster.service.ClusterManagerService.runTasks(ClusterManagerService.java:301)
at org.opensearch.cluster.service.ClusterManagerService$Batcher.run(ClusterManagerService.java:214)
at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:206)
at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:264)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.****/java.lang.Thread.run(Thread.java:1583)
Looks like a null pointer exception here:
https://github.com/opensearch-project/OpenSearch/blob/125b77300eb783f502a23ec0329c77e3d7170494/server/src/main/java/org/opensearch/gateway/PrimaryShardBatchAllocator.java#L153
@amkhar @gargmanik13 @shwetathareja Can someone take a look here? This might be related to the batch allocation change. I don't see any history of this IT failing in the past.
@SwethaGuptha can you please take a look.
Issue was reproducible on running the test multiple times:
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.FilteringAllocationIT.testDecommissionNodeNoReplicas" -Dtests.seed=EAFBBB1F7B5F6BFD -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-DJ -Dtests.timezone=Asia/Seoul -Druntime.java=21 -Dtests.iters=150 --info
With some additional logs observed that the batch cache entry is empty for node 1 that don't hold any shard, this seems like a valid state because node 1 had no shards assigned. A fix is required to handle empty shard list.
[2025-05-29T01:17:24,402][INFO ][o.o.c.a.FilteringAllocationIT] [testDecommissionNodeNoReplicas] --> verify all are allocated on node0 now
[2025-05-29T01:17:24,403][INFO ][o.o.c.m.MetadataIndexStateService] [node_t0] opening indices [[test/8jonKUN0QqmEkfiXTnk9VQ]]
[2025-05-29T01:17:24,403][INFO ][o.o.p.PluginsService ] [node_t0] PluginService:onIndexModule index:[test/8jonKUN0QqmEkfiXTnk9VQ]
[2025-05-29T01:17:24,404][INFO ][o.o.c.r.a.AllocationService] [node_t0] Applying reroute for reasons: indices opened [[[test/8jonKUN0QqmEkfiXTnk9VQ]]]
[2025-05-29T01:17:24,436][INFO ][o.o.c.r.a.AllocationService] [node_t0] Applying reroute for reasons: async_shard_batch_fetch
[2025-05-29T01:17:24,437][INFO ][o.o.g.S.InternalPrimaryBatchShardAllocator] [node_t0] Shard data is null, for node {node_t1}{61civlcPQM-l84nvIDJz8g}{v7lcsrkqSMmfRCdIMpi0pw}{127.0.0.1}{127.0.0.1:62276}{dimr}{shard_indexing_pressure_enabled=true} batch data {{node_t0}{lx4flO3ARI6jrggtrhdGxA}{FvaM1SVZSK6PqRr4KjgIfQ}{127.0.0.1}{127.0.0.1:62275}{dimr}{shard_indexing_pressure_enabled=true}=NodeGatewayStartedShardsBatch{nodeGatewayStartedShardsBatch={[test][0]=NodeGatewayStartedShards[allocationId=08ZAlBBrQKGi6DFYTyg0fw,primary=true]}}, {node_t1}{61civlcPQM-l84nvIDJz8g}{v7lcsrkqSMmfRCdIMpi0pw}{127.0.0.1}{127.0.0.1:62276}{dimr}{shard_indexing_pressure_enabled=true}=NodeGatewayStartedShardsBatch{nodeGatewayStartedShardsBatch={}}} and unassigned shards [test][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=INDEX_REOPENED], at[2025-05-28T16:17:24.404Z], delayed=false, allocation_status[fetching_shard_data]]
[2025-05-29T01:17:24,437][ERROR][o.o.c.r.BatchedRerouteService] [node_t0] unexpected failure during [cluster_reroute(async_shard_batch_fetch)], current state version [14]
java.lang.NullPointerException: Cannot invoke "org.opensearch.gateway.TransportNodesGatewayStartedShardHelper$GatewayStartedShard.allocationId()" because "shardData" is null
at org.opensearch.gateway.PrimaryShardBatchAllocator.adaptToNodeShardStates(PrimaryShardBatchAllocator.java:158) ~[main/:?]
at org.opensearch.gateway.PrimaryShardBatchAllocator.allocateUnassignedBatch(PrimaryShardBatchAllocator.java:115) ~[main/:?]
at org.opensearch.gateway.ShardsBatchGatewayAllocator$3.run(ShardsBatchGatewayAllocator.java:330) ~[main/:?]
at org.opensearch.common.util.BatchRunnableExecutor.run(BatchRunnableExecutor.java:54) ~[main/:?]
at java.base/java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
at org.opensearch.cluster.routing.allocation.AllocationService.allocateAllUnassignedShards(AllocationService.java:654) ~[main/:?]
at org.opensearch.cluster.routing.allocation.AllocationService.allocateExistingUnassignedShards(AllocationService.java:626) ~[main/:?]
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:602) ~[main/:?]
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:567) ~[main/:?]
at org.opensearch.cluster.routing.BatchedRerouteService$1.execute(BatchedRerouteService.java:136) ~[main/:?]
at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67) ~[main/:?]
Hi, @andrross @SwethaGuptha I first found this test in #18255's gradle test, spent some time analyzing it, and submitted a PR for the fix. Can you please take a look?