OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[AUTOCUT] Gradle Check Flaky Test Report for MinimumClusterManagerNodesIT

Open opensearch-ci-bot opened this issue 1 year ago • 4 comments

Flaky Test Report for MinimumClusterManagerNodesIT

Noticed the MinimumClusterManagerNodesIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference Merged Pull Request Build Details Test Name
6049587461bb001dcae616c76399173817ce81ed 14040 40080 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

org.opensearch.cluster.MinimumClusterManagerNodesIT.classMethod
8cf7f9259e69c90ed42763e17c3e7896f8a41c5c 16033 48311 org.opensearch.cluster.MinimumClusterManagerNodesIT.classMethod

org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
9675c4f6ec0d412993ef361bce44a8b789bff27b 14465 41398 org.opensearch.cluster.MinimumClusterManagerNodesIT.classMethod

org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0d780b68e900bd99319b2c4ea3b7d567f8b121e5 15121 44058 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
2eb148cdffd32058c40d6703cbb4a06eb2a2cba3 15677 47308 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3fa710b1ea46eee41130dfab06ccf7cbfb27b8e4 15648 46708 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
50f411e733ad90b54e7dfa69e85702b7e24ebe49 15582 46459 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
725ed36e85bba5b99ee34fb4f0813409247106c5 15783 47574 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
96fdbfdb87c41235e99697e04b9a0cc0adefb7bc 16385 49760 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
9cd2635ec3495a7222bde2137281f78949307f49 15483 45625 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a05d6d1a0f44920fff93080942f7f5a8d3b10bb9 15905 47730 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
b35690c886f42d2ca01fa3081e80cb4ba4aa19d9 14795 42953 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
c801270b150083c0f15f8c1f70e3c6d8f731cac0 15660 46762 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d56d8c88e07ae416d41197b05103ea2dba393967 14489 41572 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
fc1bf2c9c7b9858fe60caa3ed7ef09bbd0b30c4f 15759 47451 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0fc94ca3aa9b12558d898ff05b479b360f71ae0f 13799 39263 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
130500218a794f15df522c3ba5a31acbc77209e4 14851 43091 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3ef34558d3884f8a055be8f04c6d98da3428dcb9 16388 49737 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
56d0b76ac4c636d473177f4f12e854ce1fa6aa64 14401 41153 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a021bf98e9e4bbc9fd36b694b4053d693dcedc22 16325 49454 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
acc46316550ee203851d5c622d3b4724646d3f3e 14587 42139 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
afa479b2c5ce9a22220bf2f4de49ae4ca69c3bc7 14748 42455 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
c89a17cecaf8348d936cd42d3000c1a1fa7cf120 13888 40047 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
017f7d461ca865da57948719db0b58f40286427c 15704 47021 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
107f0ce6a3c04ea4a759a8cf980e4d23c88ab1b8 15867 47672 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
1386a9b902c4af0e3cb88a6e7e16861970415b76 13930 39885 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
2e13e9cb5b3507e9e7e85b73012c7ccd84b6844f 14107 40782 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3a38a6c86e34c5abbb0eb95d919e585e2af78feb 14365 41154 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
43e7597cdc5ba2c1852ec1796628f948633f0c57 16146 48697 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
591940911c052cf977812e3e0948b2ad5c922329 13945 39654 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
67eceaa75d788e20a1e941324210e164939a0991 15617 46561 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
90148942a56fa6a4840ad2afed195071f2d3c8e6 15401 45300 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a12e3e6212c6103b64346c2c0a3859e467751337 16051 48345 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a99fe302966cbff576c68bc2cc22dd38bab70000 16074 48420 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
dbdc1517dd1bc885d9204aff75d2e4c9ec13eee6 15589 46329 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
f85a58f64e5aaba76eb519e309881f288aff8fa6 14684 43162 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
fabf9bd596386dd745685a23b6a1dc52d0f84b7b 15293 44791 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
01c5e5642b7450bba2f3a21acdf8cf13539f65eb 15750 47339 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
03b1306b3cf2f4a37634ea6aca89512803541de6 15019 43633 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
06698dd292ecd74e86c4ffbb26270ebeabd7ce31 14922 43325 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
08c19327417afad9003236725920efc8a3abfa9b 16102 48602 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0c2ff039890c9e891da068ba401a7a77683c4a5b 14230 41005 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0eb2ec0bdbdb4f9f1f027ed108755fbae0d232f1 16348 49555 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
0ff0439dce988344c76ec0d68643bef528c652b6 16306 49410 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
11f8d79a96494fc6031894e28008da57bc3fe153 14716 42338 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
1562100eeaa9d8e108c6bc21a4030687d729fa1c 15400 45193 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
1bee506f8a6695c235d749ea90676841c3121e3c 15227 44685 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
201c673e980016ebc3c67e85b9a4d0fa684460b0 14458 41410 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
234c4da5d6e679e718c93e303f0b8bf65fbd7d5e 16026 48205 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
23d1c7a55a63250b962c1fad4e6fb962fdd156cc 16282 49484 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
36cb9ebb61f2ac5d0350cffb0cd381a2488d7cd0 16275 49252 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3a1be63f3445bb38bea5898742a2b195c1c26251 14639 41976 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
3ddb199a77b73364cce725a8dcf594ab572b3d2a 15586 46999 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
4038a3c1e4e6a43460be49f5205e745133bea4c6 14074 40854 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
468f120141b6b472a143034fe59c12fed06b4a35 15724 47162 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
4c7d94cece85b3dd1a6de2df0efd22914c1fb9a5 14839 42902 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
57a597fe2d68f283790a3658d38f7ceb39e25c72 15494 45869 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
5bb2e2851d7a2986e59548826c8d935264f523e4 15200 44343 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
6021bcaa68ef05dc9435ea9e3d8b2eb2aa6e8fad 16280 49278 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
64383dd84bc2ce1370febaec9a3c3c8dea0cf81a 14561 41702 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
71d122b9013f72c5e28a9c3240f4c7f9491aecf2 15554 46063 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
7650e6412056f0b06e069ae6b2936f9ea2da4a7f 14345 41056 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
7a58f5e32fc4ad8c48cb401c4b516fb4cd09856f 16193 48947 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
7dbaf25aa1b1b33b09ef0eeb4df92f41225fd0fc 16176 48836 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
7e7e77504c6a0f20b3ba49786057ab906b9ea880 14864 43002 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
802f2e6e4b21f27ddc6c01e7fc6f6cdcd69138d3 14424 41253 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
887698d22fbca28f29c8ffc0f635228ac209d6a1 15132 44111 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
8e32ed736372aa90db4c0ce3b85888b7b473a337 14394 41336 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
903784b0afe756ee9f3e5eed7120f2289b207682 14414 41239 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
913013bd5c6b43d8337a97a7753bc2f10f36eae4 13948 39666 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
931339e38be8f29281501a5ac8f0dddf2aa2232d 16311 49422 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a0a7098fda852eb18b0aa7d7aea23c6abdb497e7 14884 43148 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a17aea599e56c18c07767bf50d5b9603ccf2e315 14710 42244 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
a968790ed5f4e47f96271483246842989520411e 15932 47815 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
ae22e3ff32ef15a6af302c50872f1fa0e8e140fe 16065 48417 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
afeddc228ba7791a549fb7c6ef94349d432c0824 14037 40910 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
b8c78196438897132f6819460ebb7d4222b39297 12782 41391 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
bb013dadc797bd3349a630444d59b6e9b6b96429 13717 39669 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
bde48a7b925b6cb20099c9d31023127288d3fb02 14133 40532 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
bf4367877eab27dff05a74d683d14e820130172d 13809 39614 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
c49eca4061d3af9af77a3eacd28043200343ba98 13721 40576 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
cad81b0e468164f5d58aaa83ca4b3d2f462c4990 15216 45855 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d1cd7a2b8ba24a5e5ef3278315efd589e8c6eeee 15512 45861 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d2bc9fc3daaaa33273bace58c4a94d2ae3e7be5c 15656 46780 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d5c4081100ebe30e9dc84bb9d86003183b489bfd 16130 48628 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
d7b011612014a78283c56425d493550b64ad2b5b 16250 49186 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
e1a632fd8a88b0fad3d11708dd389c88eb0eeaa3 14340 40973 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
e67ced73226453d5a5504c78f3b7d5ae90b4914e 13784 39156 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
edcbfd49f0e047bf34fc88c9aeca4a20fde5ee45 14923 43290 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
eeb2f3997bb84f33f13b848e125051ecf2c2a1c7 16048 48315 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
f9d15df3a14b4ae32aeda3931867fc72dfd990c2 15715 47146 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
fef20032943378c02f8f3424865395058989e186 15181 44308 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the MinimumClusterManagerNodesIT class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

opensearch-ci-bot avatar Jun 13 '24 21:06 opensearch-ci-bot

Adding the Storage:Remote label to this one because I believe it has been traced back to a commit related to that feature. From the original issue:

I believe I have traced this back to the commit that introduced the flakiness: 9119b6dc20ea11d95a399c68505f1d858b78e30e (#9105)

The following command will reliably reproduce the failure for me:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100

If I select the commit immediately preceding 9119b6dc20e then it does not reproduce.

This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here.

andrross avatar Jun 17 '24 20:06 andrross

Mostly just adding some debugging logging statements.

  1. We start out with 3 nodes, [node_t0, node_t1, node_t2]
  2. We find the set of non-CM nodes, [node_t1, node_t0]
  3. We shut down the non-CM nodes, leaving [node_t2]
  4. We use the local path of the two nodes shut down to start up new nodes, they have the same UUID
  5. Most of the time when the test passes, the new nodes are renamed, node_t0 -> node_t3 and node_t1 -> node_t4.
  6. When the test fails, it's consistently because the (formerly CM) node still thinks it's in a cluster with node_t0 and node_t1 and its cluster state version is 2 versions behind the other two nodes.
  7. The other two (new) nodes think that node_t2 is the cluster manager but it hasn't caught up yet.
  8. The 2nd cluster state update is likely the cluster manager assignment, so the root cause is probably the first cluster state update that is failing on the (formerly cluster manager) node:
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode [test][1], node[ZdcgPV1JSmut1DojEIhCEw], [P], s[STARTED], a[id=Yl7dClDeQ0Ox4vlafvVO_A]
        at __randomizedtesting.SeedInfo.seed([D54CD0A4D377FB88]:0)
        at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:840)
        at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:712)
        at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:651)
        at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:294)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:626)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:612)
        at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:580)
        at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:503)
        at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:205)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:923)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

dbwiddis avatar Aug 24 '24 19:08 dbwiddis

Adding a flush after the 2 nodes are randomly dropped seems effective in preventing the flakiness, but also takes a long time

client().admin().indices().prepareFlush().execute().actionGet();

Adding a refresh() fails at this point because there is no cluster manager.

dbwiddis avatar Aug 24 '24 20:08 dbwiddis

Placing a refresh() between the two node terminations seems to reduce, but not eliminate, the flakiness.

I'm about at the limit of what debug logging can tell me, but I'd suggest someone with knowledge of the linked PR investigate the interaction of that code with the cluster state.

dbwiddis avatar Aug 24 '24 20:08 dbwiddis