OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Test timeout waiting for shards to rebalance

Open kkewwei opened this issue 1 year ago • 2 comments

Describe the bug

In the flaky test ClusterRerouteIT.testDelayWithALargeAmountOfShards #14510 , We can clearly see that the shards are rebalanced back and forth: https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/gradle-check/runs/41527/nodes/18/steps/32/log/?start=0

We can see from the log:

2024-06-22T23:45:36,760: node_t0 is shut down. 2024-06-22T23:45:37,042: node_t3 is elected as new cluster manager. 2024-06-22T23:45:49,898: all the indices are green. 2024-06-22T23:45:49,898 ~ 2024-06-22T23:47:39,526: the test8][4] and [test4][4] are rebalanced back and forth. 2024-06-22T23:47:39,526: ensureGreen timed out.

It seems to be a very low probability bug, I tried to reproduce several times, but failed, so open the issue to track the bug.

Related component

Cluster Manager

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

kkewwei avatar Jun 26 '24 09:06 kkewwei

My understanding is that ClusterRerouteIT.testDelayWithALargeAmountOfShards created a lot of shards and took down a node. The cluster would turn green at the end of the test, but shard placement wasn't optimal, so the cluster would continue rebalancing towards the optimal state and sometimes the test would timeout waiting for shard rebalancing to stop. @kkewwei Is that right?

The question here is whether this is a more general pattern that is causing flakiness in other test cases.

andrross avatar Jun 26 '24 15:06 andrross

@andrross, yes. The strange thing is that only two shards are balanced back and forth, and it lasts for a long time.

kkewwei avatar Jun 30 '24 02:06 kkewwei

[Triage - attendees 1 2 3 4 5 6]

rwali-aws avatar Jul 11 '24 06:07 rwali-aws