OpenSearch
OpenSearch copied to clipboard
[BUG] Test timeout waiting for shards to rebalance
Describe the bug
In the flaky test ClusterRerouteIT.testDelayWithALargeAmountOfShards #14510 , We can clearly see that the shards are rebalanced back and forth:
https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/gradle-check/runs/41527/nodes/18/steps/32/log/?start=0
We can see from the log:
2024-06-22T23:45:36,760: node_t0 is shut down. 2024-06-22T23:45:37,042: node_t3 is elected as new cluster manager. 2024-06-22T23:45:49,898: all the indices are green. 2024-06-22T23:45:49,898 ~ 2024-06-22T23:47:39,526: the test8][4] and [test4][4] are rebalanced back and forth. 2024-06-22T23:47:39,526: ensureGreen timed out.
It seems to be a very low probability bug, I tried to reproduce several times, but failed, so open the issue to track the bug.
Related component
Cluster Manager
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context Add any other context about the problem here.
My understanding is that ClusterRerouteIT.testDelayWithALargeAmountOfShards created a lot of shards and took down a node. The cluster would turn green at the end of the test, but shard placement wasn't optimal, so the cluster would continue rebalancing towards the optimal state and sometimes the test would timeout waiting for shard rebalancing to stop. @kkewwei Is that right?
The question here is whether this is a more general pattern that is causing flakiness in other test cases.
@andrross, yes. The strange thing is that only two shards are balanced back and forth, and it lasts for a long time.