Varun Bansal

Results 48 comments of Varun Bansal

The recent failure reported is due to suite timeout ``` java.lang.Exception: Test abandoned because suite timeout was reached. at __randomizedtesting.SeedInfo.seed([ADC655A786161FFE]:0) ``` found one blocked thread ``` 12) Thread[id=4135, name=opensearch[node_s2][generic][T#3], state=BLOCKED,...

Test was not able to properly terminate the node_s2, which lingered on for 20 mins ``` [2023-12-22T10:14:00,407][INFO ][o.o.t.InternalTestCluster] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] Closing filtered random node [node_s2] [2023-12-22T10:14:00,408][INFO ][o.o.n.Node ] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] stopping ......

Looks like the thread was blocked on https://github.com/opensearch-project/OpenSearch/blob/5c82ab885a876d659c9714c3b080488777506027/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L4752-L4763

We are terminating the nodes in order of replica then primary, but we only check for shard 0. replica node for shard 0 would have had primaries for other shards....

index shutdown and replica to primary promotion causing a deadlock ``` "opensearch[node_s2][generic][T#3]" ID=4135 BLOCKED on java.lang.Object@3c0fa470 owned by "opensearch[node_s2][indices_shutdown][T#1]" ID=4183 at app//org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768) - blocked on java.lang.Object@3c0fa470 at app//org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113) at app//org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)...

Created a separate bug to track the deadlock issue as this bug is tracking other reasons due to which restore tests are failing. https://github.com/opensearch-project/OpenSearch/issues/11869

I wasn't able to repro this even after 1K iterations. Since there have been only 2 occurrences and last one was almost 2 months ago, closing the issue. Feel free...

All 3 recent failures were for [org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown](https://build.ci.opensearch.org/job/gradle-check/36960/testReport/junit/org.opensearch.remotestore/RemoteStoreRestoreIT/testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown/) failure trace ``` com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT] at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0) Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not...

Ran around 5000 iterations locally and i cannot repro this. Will add trace logging annotation on this test to get more details in the PR build failures to help debug.

> I am guessing the intent of an offline node is to execute offline tasks. The naming is confusing. A node that's offline is ... offline. I think a better...