OpenSearch
OpenSearch copied to clipboard
[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes
Describe the bug
- org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithRefreshedDataPrimaryReplicaDown
- org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithCommittedData
- org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithRefreshedData
- org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithNoDataPostCommit
- org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown
- org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreDataOnlyInTranslog
- org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithCommittedDataPrimaryReplicaDown
tests are flaky.
Stacktrace
Consistent with all but testRTSRestoreWithRefreshedDataPrimaryReplicaDown
java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
coordination_metadata:
term: 1
last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
voting tombstones: []
metadata customs:
repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}} index-graveyard: IndexGraveyard[[]]
nodes:
{node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, local, cluster-manager
{node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: DDhrUTuzS0-TkQaTHp03EQ
from_diff: false
meta data version: 0
coordination_metadata:
term: 0
last_committed_config: VotingConfiguration{}
last_accepted_config: VotingConfiguration{}
voting tombstones: []
metadata customs:
index-graveyard: IndexGraveyard[[]]
blocks:
_global_:
1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX 2,no cluster-manager, blocks METADATA_WRITE
nodes:
{node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[GD7uyrgiTxSKXQLlfKgK_A][V]
---- unassigned
, cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
coordination_metadata:
term: 1
last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
voting tombstones: []
metadata customs:
repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}} index-graveyard: IndexGraveyard[[]]
nodes:
{node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, cluster-manager
{node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
]
From testRTSRestoreWithRefreshedDataPrimaryReplicaDown
java.lang.AssertionError: inconsistent generation
To Reproduce CI - https://build.ci.opensearch.org/job/gradle-check/29522/testReport/
Expected behavior Test should always pass
The recent failure reported is due to suite timeout
java.lang.Exception: Test abandoned because suite timeout was reached.
at __randomizedtesting.SeedInfo.seed([ADC655A786161FFE]:0)
found one blocked thread
12) Thread[id=4135, name=opensearch[node_s2][generic][T#3], state=BLOCKED, group=TGRP-RemoteStoreRestoreIT]
at org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
at org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
at org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
at org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
at org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
at org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
at org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Test was not able to properly terminate the node_s2, which lingered on for 20 mins
[2023-12-22T10:14:00,407][INFO ][o.o.t.InternalTestCluster] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] Closing filtered random node [node_s2]
[2023-12-22T10:14:00,408][INFO ][o.o.n.Node ] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] stopping ...
[2023-12-22T10:14:00,408][INFO ][o.o.c.c.Coordinator ] [node_s2] cluster-manager node [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{jBjkBSbZSL-3AlK2H7Z0gg}{127.0.0.1}{127.0.0.1:35721}{m}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.transport.NodeDisconnectedException: [node_s0][127.0.0.1:35721][disconnected] disconnected
[2023-12-22T10:14:00,409][INFO ][o.o.c.c.FollowersChecker ] [node_s0] FollowerChecker{discoveryNode={node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{nOLSTsBLR7-smx0D64pDvA}{127.0.0.1
...[truncated 1074993 chars]...
v8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:28,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:38,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:48,615][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:58,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:08,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:18,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:28,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
....
[2023-12-22T10:33:28,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:38,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:48,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,638][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] stopped
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closing ...
[2023-12-22T10:33:58,931][INFO ][o.o.n.Node ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closed
[2023-12-22T10:33:59,438][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] [RemoteStoreRestoreIT#testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown]: cleaning up after test
[2023-12-22T10:33:59,439][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] after test
Looks like the thread was blocked on
https://github.com/opensearch-project/OpenSearch/blob/5c82ab885a876d659c9714c3b080488777506027/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L4752-L4763
We are terminating the nodes in order of replica then primary, but we only check for shard 0. replica node for shard 0 would have had primaries for other shards. Terminating it would lead to relocation and during relocation we create the read only engine(code pointer in last comment), which where our locks are blocking the thread
https://github.com/opensearch-project/OpenSearch/blob/5c82ab885a876d659c9714c3b080488777506027/server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreRestoreIT.java#L137-L156
index shutdown and replica to primary promotion causing a deadlock
"opensearch[node_s2][generic][T#3]" ID=4135 BLOCKED on java.lang.Object@3c0fa470 owned by "opensearch[node_s2][indices_shutdown][T#1]" ID=4183
at app//org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
- blocked on java.lang.Object@3c0fa470
at app//org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
at app//org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
at app//org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
at app//org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
at app//org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
at app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
at app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
at app//org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
at app//org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
at app//org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
at app//org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca36238.run(Unknown Source)
at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
at app//org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca11738.accept(Unknown Source)
at app//org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at app//org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at [email protected]/java.lang.Thread.runWith(Thread.java:1596)
at [email protected]/java.lang.Thread.run(Thread.java:1583)
Locked synchronizers:
- java.util.concurrent.locks.ReentrantLock$NonfairSync@28675992
- java.util.concurrent.ThreadPoolExecutor$Worker@1197c3a3
- java.util.concurrent.locks.ReentrantLock$NonfairSync@57ac517a
20) Thread[id=4183, name=opensearch[node_s2][indices_shutdown][T#1], state=WAITING, group=TGRP-RemoteStoreRestoreIT]
at java.base/jdk.internal.misc.Unsafe.park(Native Method)
at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
at java.base/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
at org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
at org.opensearch.index.engine.Engine.close(Engine.java:2070)
at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:81)
at org.opensearch.index.shard.IndexShard$11.close(IndexShard.java:4785)
at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:2059)
at org.opensearch.index.shard.IndexShard.close(IndexShard.java:2003)
at org.opensearch.index.IndexService.closeShard(IndexService.java:642)
at org.opensearch.index.IndexService.removeShard(IndexService.java:618)
at org.opensearch.index.IndexService.close(IndexService.java:389)
at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1047)
at org.opensearch.indices.IndicesService.lambda$doStop$3(IndicesService.java:542)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
In the IndexShard.close flow we acquire engineMutex and then try to acquire write lock on the engine. In replica to primary promotion flow we acquire readlock on the engine first and then try to acquire engineMutex . Read and Write lock are shared via ReenterantReadWriteLock, leading to deadlock.
Created a separate bug to track the deadlock issue as this bug is tracking other reasons due to which restore tests are failing.
https://github.com/opensearch-project/OpenSearch/issues/11869
I wasn't able to repro this even after 1K iterations. Since there have been only 2 occurrences and last one was almost 2 months ago, closing the issue. Feel free to reopen if you encounter same issue again
org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown failed again here - https://github.com/opensearch-project/OpenSearch/pull/12252#issuecomment-2010136870
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at [email protected]/java.lang.Thread.run(Thread.java:1583)
All 3 recent failures were for
failure trace
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at [email protected]/java.lang.Thread.run(Thread.java:1583)
Ran around 5000 iterations locally and i cannot repro this. Will add trace logging annotation on this test to get more details in the PR build failures to help debug.