OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Cluster manager bootstrap takes time causing intermittent failures in integration tests (o.o.c.ClusterHealthIT.testHealthOnClusterManagerFailover)

Open dreamer-89 opened this issue 3 years ago • 7 comments

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. Run o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover and enable index creation.
  2. Reduce the master node timeout to <10 seconds.
  3. Run test multiple times. It fails with >90% when 1 second timeout is used.

Expected behavior Master node boot up time should stay less than < 1 minut.

Host/Environment (please complete the following information):

  • OS: iOS

dreamer-89 avatar Dec 29 '21 20:12 dreamer-89

Failure in https://github.com/opensearch-project/OpenSearch/pull/1874#issuecomment-1011640866 looks the same

> Task :server:internalClusterTest

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=2391EC7752804595 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-EC -Dtests.timezone=Etc/Greenwich -Druntime.java=17

org.opensearch.cluster.ClusterHealthIT > testHealthOnMasterFailover FAILED
    java.lang.AssertionError: expected same:<RED> was not:<GREEN>
        at __randomizedtesting.SeedInfo.seed([2391EC7752804595:BD7EC35A516A5973]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotSame(Assert.java:829)
        at org.junit.Assert.assertSame(Assert.java:772)
        at org.junit.Assert.assertSame(Assert.java:783)
        at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:393)

dblock avatar Jan 13 '22 14:01 dblock

Similar failure: https://github.com/opensearch-project/OpenSearch/pull/2037 #2047

saratvemulapalli avatar Feb 02 '22 19:02 saratvemulapalli

For detail, please see issue https://github.com/opensearch-project/OpenSearch/issues/1693 for the error message of MasterNotDiscoveredException

tlfeng avatar Mar 15 '22 03:03 tlfeng

Another one in https://github.com/opensearch-project/OpenSearch/pull/5354#issuecomment-1325193006

dblock avatar Nov 23 '22 16:11 dblock

checking

rahulkarajgikar avatar Apr 18 '24 05:04 rahulkarajgikar

5k runs on linux machine with 2 minutes, was able to see 5 failures.

5k runs on linux machine with 3 minutes, did not see any failures.

rahulkarajgikar avatar May 02 '24 11:05 rahulkarajgikar

Raised PR to increase timeout: https://github.com/opensearch-project/OpenSearch/pull/13505

rahulkarajgikar avatar May 02 '24 11:05 rahulkarajgikar