OpenSearch
OpenSearch copied to clipboard
[BUG] StableClusterManagerDisruptionIT.testStaleClusterManagerNotHijackingMajority (Random Test Failure)
Describe the bug Random Test Failure. Please dig in, and figure out what went wrong :(
Add more information:
https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_1066.log
> Task :server:internalClusterTest
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority" -Dtests.seed=69CC1732A5C19596 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sv -Dtests.timezone=America/Mexico_City -Druntime.java=17
org.opensearch.discovery.StableMasterDisruptionIT > testStaleMasterNotHijackingMajority FAILED
java.lang.AssertionError: node_t2: [Tuple [v1=node_t1, v2=null]]
at __randomizedtesting.SeedInfo.seed([69CC1732A5C19596:36CA5A3D841A4A9A]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.opensearch.discovery.StableMasterDisruptionIT.lambda$testStaleMasterNotHijackingMajority$5(StableMasterDisruptionIT.java:253)
at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1048)
at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1021)
at org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority(StableMasterDisruptionIT.java:250)
https://github.com/opensearch-project/OpenSearch/pull/2541#issuecomment-1074479459
Test renamed following new naming convention of cluster manager instead of master node.
One more occurrence https://github.com/opensearch-project/OpenSearch/pull/6838#issuecomment-1484167698
Checking
Ran 5000 iterations of the test locally and did not see any failures:
$ ./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.StableClusterManagerDisruptionIT.testStaleClusterManagerNotHijackingMajority" -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sv -Dtests.timezone=America/Mexico_City -Druntime.java=17 -Dtests.iters=5000 -Dtests.timeoutSuite=180000000!
Starting a Gradle Daemon, 1 busy Daemon could not be reused, use --status for details
> Configure project :
========================= WARNING =========================
Backwards compatibility tests are disabled!
See https://github.com/opensearch-project/OpenSearch/issues/4173
===========================================================
=======================================
OpenSearch Build Hamster says Hello!
Gradle Version : 8.4
OS Info : Mac OS X 14.3.1 (aarch64)
Runtime JDK Version : 17 (Amazon Corretto JDK)
Runtime java.home : /Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home
Gradle JDK Version : 21 (Amazon Corretto JDK)
Gradle java.home : /Library/Java/JavaVirtualMachines/amazon-corretto-21.jdk/Contents/Home
Random Testing Seed : 9F886D8E98DA3AB1
In FIPS 140 mode : false
=======================================
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/karajgik/workplace/OpenSearch_karajgik/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/karajgik/.gradle/wrapper/dists/gradle-8.4-all/56r6xik2f6skrm47et0ibifug/gradle-8.4/lib/plugins/gradle-testing-base-8.4.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release
BUILD SUCCESSFUL in 17h 59m 55s
55 actionable tasks: 1 executed, 54 up-to-date
Test sets cluster publish timeout to 1s. Was able to reproduce only when setting cluster publish timeout to 10ms.
Although was not able to reproduce the error with default values, will raise PR to increase cluster publish timeout to 2s in the test to get rid of flakiness