OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock flaky

Open sohami opened this issue 1 year ago • 8 comments

Describe the bug Test org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock is flaky

To Reproduce

سبت 11, 2023 1:44:23 م com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#339,opensearch[node_t2][clusterApplierService#updateTask][T#1],5,TGRP-MinimumClusterManagerNodesIT]
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode [test][2], node[IADuWGkCTpuWEnWUFcbkSQ], [P], s[STARTED], a[id=oar4Dv6STMWSzO-FDH4bMA]
	at __randomizedtesting.SeedInfo.seed([7E7C985F304948B0]:0)
	at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:752)
	at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:710)
	at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:650)
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:293)
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606)
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593)
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561)
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484)
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
NOTE: leaving temporary files on disk at: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.cluster.MinimumClusterManagerNodesIT_7E7C985F304948B0-001
NOTE: test params are: codec=Asserting(Lucene95), sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=ar-SD, timezone=Europe/Lisbon
NOTE: Linux 5.15.0-1039-aws amd64/Eclipse Adoptium 20.0.2 (64-bit)/cpus=32,threads=1,free=204825744,total=536870912
NOTE: All tests run in this JVM: [PendingTasksBlocksIT, GetIndexIT, ActiveShardsObserverIT, MinimumClusterManagerNodesIT]

Expected behavior Test should always pass

Plugins Standard

Screenshots

Host/Environment (please complete the following information): https://build.ci.opensearch.org/job/gradle-check/25287/testReport/junit/org.opensearch.cluster/MinimumClusterManagerNodesIT/testThreeNodesNoClusterManagerBlock/

Additional context https://build.ci.opensearch.org/job/gradle-check/25287/


I (@andrross) am adding the content from this comment to the description here because it has now been buried in the comment stream:

I believe I have traced this back to the commit that introduced the flakiness: 9119b6dc20ea11d95a399c68505f1d858b78e30e (#9105)

The following command will reliably reproduce the failure for me:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100

If I select the commit immediately preceding 9119b6dc20e then it does not reproduce.

This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here.

sohami avatar Sep 12 '23 20:09 sohami

https://github.com/opensearch-project/OpenSearch/pull/10519#issuecomment-1754410824

andrross avatar Oct 10 '23 13:10 andrross

https://github.com/opensearch-project/OpenSearch/pull/10670#issuecomment-1776629574

shwetathareja avatar Oct 24 '23 08:10 shwetathareja

https://github.com/opensearch-project/OpenSearch/pull/10964#issuecomment-1782966285

amkhar avatar Oct 27 '23 13:10 amkhar

https://github.com/opensearch-project/OpenSearch/pull/10986#issuecomment-1789416684

ashking94 avatar Nov 01 '23 18:11 ashking94

I believe I have traced this back to the commit that introduced the flakiness: 9119b6dc20ea11d95a399c68505f1d858b78e30e (#9105)

The following command will reliably reproduce the failure for me:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100

If I select the commit immediately preceding 9119b6dc20e then it does not reproduce.

This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here. @psychbot @gbbafna @Bukhtawar @ashking94 Sorry for the spam folks but you were all involved with the review of the PR so want to make sure you're aware.

andrross avatar Nov 17 '23 23:11 andrross

  • Impacted PR #11193, failure logs.

peternied avatar Nov 28 '23 23:11 peternied

Adding Storage:Remote label as it appears PR #9105 introduced this flakiness

andrross avatar Feb 21 '24 20:02 andrross

@amkhar @gauravruhela @ramaran Over the past 30 days, this test has adversely affected 17 pull requests (PRs), including [#12459, #12382, #12376, #12375, #12374, #12267 (repeated), #12180, #12163 (repeated), #12151, #12133, #12117, #12111 (repeated)].

Please prioritize fixing this test or disabling the test case until it can be fixed.

peternied avatar Feb 29 '24 22:02 peternied

java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, {node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 10
state uuid: R5lx6pcBSomVZDCx7jW65Q
from_diff: false
meta data version: 7
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
      last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
      voting tombstones: []
   [test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
      0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
      1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
      2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes: 
   {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, local
   {node_t1}{NDGm--CAR-6KZLASPentjg}{XHgAw-wyToy7ZtQ8rsTQ9g}{127.0.0.1}{127.0.0.1:41977}{dimr}{shard_indexing_pressure_enabled=true}
   {node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
routing_table (version 7):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]

routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
---- unassigned
, cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 11
state uuid: KWBexCcJSmiAcjbR6BWO7w
from_diff: false
meta data version: 8
   coordination_metadata:
      term: 2
      last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
      last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
      voting tombstones: []
   [test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
      0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
      1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
      2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, cluster-manager
   {node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
   {node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 8):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]

routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
---- unassigned
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: ceEdRNQoSp2wgPupsMUYdA
from_diff: false
meta data version: 0
   coordination_metadata:
      term: 0
      last_committed_config: VotingConfiguration{}
      last_accepted_config: VotingConfiguration{}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX      2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes: 
   {node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
---- unassigned
]

reta avatar May 17 '24 15:05 reta

Closing in favor of the autocut #14289

andrross avatar Jun 17 '24 20:06 andrross