OpenSearch
OpenSearch copied to clipboard
[BUG] Cluster manager bootstrap takes time causing intermittent failures in integration tests (o.o.c.ClusterHealthIT.testHealthOnClusterManagerFailover)
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
- Run o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover and enable index creation.
- Reduce the master node timeout to <10 seconds.
- Run test multiple times. It fails with >90% when 1 second timeout is used.
Expected behavior Master node boot up time should stay less than < 1 minut.
Host/Environment (please complete the following information):
- OS: iOS
Failure in https://github.com/opensearch-project/OpenSearch/pull/1874#issuecomment-1011640866 looks the same
> Task :server:internalClusterTest
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=2391EC7752804595 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-EC -Dtests.timezone=Etc/Greenwich -Druntime.java=17
org.opensearch.cluster.ClusterHealthIT > testHealthOnMasterFailover FAILED
java.lang.AssertionError: expected same:<RED> was not:<GREEN>
at __randomizedtesting.SeedInfo.seed([2391EC7752804595:BD7EC35A516A5973]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotSame(Assert.java:829)
at org.junit.Assert.assertSame(Assert.java:772)
at org.junit.Assert.assertSame(Assert.java:783)
at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:393)
Similar failure: https://github.com/opensearch-project/OpenSearch/pull/2037 #2047
For detail, please see issue https://github.com/opensearch-project/OpenSearch/issues/1693 for the error message of MasterNotDiscoveredException
Another one in https://github.com/opensearch-project/OpenSearch/pull/5354#issuecomment-1325193006
checking
5k runs on linux machine with 2 minutes, was able to see 5 failures.
5k runs on linux machine with 3 minutes, did not see any failures.
Raised PR to increase timeout: https://github.com/opensearch-project/OpenSearch/pull/13505