OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes

Open peternied opened this issue 1 year ago • 10 comments

Describe the bug

  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithRefreshedDataPrimaryReplicaDown
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithCommittedData
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithRefreshedData
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithNoDataPostCommit
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreDataOnlyInTranslog
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithCommittedDataPrimaryReplicaDown

tests are flaky.

Stacktrace

Consistent with all but testRTSRestoreWithRefreshedDataPrimaryReplicaDown

java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      voting tombstones: []
metadata customs:
   repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}}   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, local, cluster-manager
   {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: DDhrUTuzS0-TkQaTHp03EQ
from_diff: false
meta data version: 0
   coordination_metadata:
      term: 0
      last_committed_config: VotingConfiguration{}
      last_accepted_config: VotingConfiguration{}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX      2,no cluster-manager, blocks METADATA_WRITE
nodes: 
   {node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[GD7uyrgiTxSKXQLlfKgK_A][V]
---- unassigned
, cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      voting tombstones: []
metadata customs:
   repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}}   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, cluster-manager
   {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
]

From testRTSRestoreWithRefreshedDataPrimaryReplicaDown

java.lang.AssertionError:  inconsistent generation 

To Reproduce CI - https://build.ci.opensearch.org/job/gradle-check/29522/testReport/

Expected behavior Test should always pass

peternied avatar Nov 03 '23 16:11 peternied

The recent failure reported is due to suite timeout

java.lang.Exception: Test abandoned because suite timeout was reached.
	at __randomizedtesting.SeedInfo.seed([ADC655A786161FFE]:0)

found one blocked thread

 12) Thread[id=4135, name=opensearch[node_s2][generic][T#3], state=BLOCKED, group=TGRP-RemoteStoreRestoreIT]
        at org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
        at org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
        at org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
        at org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
        at org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
        at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
        at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
        at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
        at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
        at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
        at org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
        at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
        at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
        at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
        at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

linuxpi avatar Jan 12 '24 05:01 linuxpi

Test was not able to properly terminate the node_s2, which lingered on for 20 mins

[2023-12-22T10:14:00,407][INFO ][o.o.t.InternalTestCluster] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] Closing filtered random node [node_s2] 
[2023-12-22T10:14:00,408][INFO ][o.o.n.Node               ] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] stopping ...
[2023-12-22T10:14:00,408][INFO ][o.o.c.c.Coordinator      ] [node_s2] cluster-manager node [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{jBjkBSbZSL-3AlK2H7Z0gg}{127.0.0.1}{127.0.0.1:35721}{m}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.transport.NodeDisconnectedException: [node_s0][127.0.0.1:35721][disconnected] disconnected
[2023-12-22T10:14:00,409][INFO ][o.o.c.c.FollowersChecker ] [node_s0] FollowerChecker{discoveryNode={node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{nOLSTsBLR7-smx0D64pDvA}{127.0.0.1
...[truncated 1074993 chars]...
v8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:28,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:38,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:48,615][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:58,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:08,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:18,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:28,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1

....


[2023-12-22T10:33:28,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:38,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:48,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,638][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] stopped
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closing ...
[2023-12-22T10:33:58,931][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closed
[2023-12-22T10:33:59,438][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] [RemoteStoreRestoreIT#testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown]: cleaning up after test
[2023-12-22T10:33:59,439][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] after test

linuxpi avatar Jan 12 '24 05:01 linuxpi

Looks like the thread was blocked on

https://github.com/opensearch-project/OpenSearch/blob/5c82ab885a876d659c9714c3b080488777506027/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L4752-L4763

linuxpi avatar Jan 12 '24 05:01 linuxpi

We are terminating the nodes in order of replica then primary, but we only check for shard 0. replica node for shard 0 would have had primaries for other shards. Terminating it would lead to relocation and during relocation we create the read only engine(code pointer in last comment), which where our locks are blocking the thread

https://github.com/opensearch-project/OpenSearch/blob/5c82ab885a876d659c9714c3b080488777506027/server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreRestoreIT.java#L137-L156

linuxpi avatar Jan 12 '24 05:01 linuxpi

index shutdown and replica to primary promotion causing a deadlock

"opensearch[node_s2][generic][T#3]" ID=4135 BLOCKED on java.lang.Object@3c0fa470 owned by "opensearch[node_s2][indices_shutdown][T#1]" ID=4183
	at app//org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
	- blocked on java.lang.Object@3c0fa470
	at app//org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
	at app//org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
	at app//org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
	at app//org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
	at app//org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
	at app//org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
	at app//org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
	at app//org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
	at app//org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca36238.run(Unknown Source)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
	at app//org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca11738.accept(Unknown Source)
	at app//org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at app//org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at [email protected]/java.lang.Thread.runWith(Thread.java:1596)
	at [email protected]/java.lang.Thread.run(Thread.java:1583)
	Locked synchronizers:
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@28675992
	- java.util.concurrent.ThreadPoolExecutor$Worker@1197c3a3
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@57ac517a

  20) Thread[id=4183, name=opensearch[node_s2][indices_shutdown][T#1], state=WAITING, group=TGRP-RemoteStoreRestoreIT]
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
        at java.base/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
        at org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
        at org.opensearch.index.engine.Engine.close(Engine.java:2070)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:81)
        at org.opensearch.index.shard.IndexShard$11.close(IndexShard.java:4785)
        at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:2059)
        at org.opensearch.index.shard.IndexShard.close(IndexShard.java:2003)
        at org.opensearch.index.IndexService.closeShard(IndexService.java:642)
        at org.opensearch.index.IndexService.removeShard(IndexService.java:618)
        at org.opensearch.index.IndexService.close(IndexService.java:389)
        at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1047)
        at org.opensearch.indices.IndicesService.lambda$doStop$3(IndicesService.java:542)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

In the IndexShard.close flow we acquire engineMutex and then try to acquire write lock on the engine. In replica to primary promotion flow we acquire readlock on the engine first and then try to acquire engineMutex . Read and Write lock are shared via ReenterantReadWriteLock, leading to deadlock.

linuxpi avatar Jan 12 '24 08:01 linuxpi

Created a separate bug to track the deadlock issue as this bug is tracking other reasons due to which restore tests are failing.

https://github.com/opensearch-project/OpenSearch/issues/11869

linuxpi avatar Jan 12 '24 12:01 linuxpi

I wasn't able to repro this even after 1K iterations. Since there have been only 2 occurrences and last one was almost 2 months ago, closing the issue. Feel free to reopen if you encounter same issue again

linuxpi avatar Feb 05 '24 17:02 linuxpi

org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown failed again here - https://github.com/opensearch-project/OpenSearch/pull/12252#issuecomment-2010136870

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at [email protected]/java.lang.Thread.run(Thread.java:1583)

shwetathareja avatar Mar 20 '24 17:03 shwetathareja

All 3 recent failures were for

org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown

failure trace

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at [email protected]/java.lang.Thread.run(Thread.java:1583)

linuxpi avatar Apr 22 '24 17:04 linuxpi

Ran around 5000 iterations locally and i cannot repro this. Will add trace logging annotation on this test to get more details in the PR build failures to help debug.

linuxpi avatar May 01 '24 17:05 linuxpi