OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[AUTOCUT] Gradle Check Flaky Test Report for RecoveryWhileUnderLoadIT

Open opensearch-ci-bot opened this issue 1 year ago • 1 comments

Flaky Test Report for RecoveryWhileUnderLoadIT

Noticed the RecoveryWhileUnderLoadIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference Merged Pull Request Build Details Test Name
d0c2e39ae05454775b8063e09a88dd5f5834c49f 17797 59086 org.opensearch.recovery.RecoveryWhileUnderLoadIT.classMethod

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileRelocating {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasTest {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithDerivedSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWithRelocationAndDerivedSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoveryWithDerivedSourceEnabled {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testReplicaRecoveryWithDerivedSourceFromTranslog {p0={"cluster.indices.replication.strategy":"SEGMENT"}}
ec5addab82d459743c5c6bb579e6573ecd610e03 18500 59161 org.opensearch.recovery.RecoveryWhileUnderLoadIT.classMethod

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileRelocating {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasTest {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithDerivedSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWithRelocationAndDerivedSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoveryWithDerivedSourceEnabled {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testReplicaRecoveryWithDerivedSourceFromTranslog {p0={"cluster.indices.replication.strategy":"SEGMENT"}}
b3ad02aad87370205d8bd80979b44980b64aadc6 18421 59113 org.opensearch.recovery.RecoveryWhileUnderLoadIT.classMethod

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileRelocating {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithDerivedSource {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}
7116a2c0633a425851393288d7cfa59911e10cf8 15138 45268 org.opensearch.recovery.RecoveryWhileUnderLoadIT.classMethod

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"SEGMENT"}}
1bb42ecfafad91528d2b869579c0e9e0fbfca130 14508 41533 org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}
528e2b0073af8c1557c528d1bdf360183ae011a4 17855 59021 org.opensearch.recovery.RecoveryWhileUnderLoadIT.classMethod

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithDerivedSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}
eb5035398967510165fcab4ff4664fd3e80e2cce 15418 45418 org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}

org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}
8d3386cd1f657b0f885d3f5431769a414ff1b43b 18043 57206 org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}
9a3fc307d48a800f96241bb26d4c2f46790a3db3 18003 57018 org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}
c92b8ea8742b7ae48e5b169c02a255543c4c7b5d 18435 59128 org.opensearch.recovery.RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithDerivedSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the RecoveryWhileUnderLoadIT class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

opensearch-ci-bot avatar Jun 22 '24 18:06 opensearch-ci-bot

[Triage - attendees 1 2 3] Approving this autocut issue

peternied avatar Jun 26 '24 15:06 peternied

It looks like something changed related to this test about a week ago. See the dashboard.

Image

andrross avatar Jun 16 '25 21:06 andrross

PR #18054 recently added test cases here. @tanik98 @shwetathareja can you take a look?

andrross avatar Jun 16 '25 21:06 andrross

All the recovery-related ITs modified in #18054 now seem to be much more flaky:

Image

andrross avatar Jun 16 '25 21:06 andrross

@msfroh @rishabhmaurya What do you think? Should we revert #18054. I'm seeing quite a lot of failures.

andrross avatar Jun 16 '25 22:06 andrross

@msfroh @rishabhmaurya What do you think? Should we revert #18054. I'm seeing quite a lot of failures.

I am in favor of reverting. There was an attempt to fix tests, but it was evidently not sufficient.

msfroh avatar Jun 16 '25 23:06 msfroh

+1

rishabhmaurya avatar Jun 17 '25 04:06 rishabhmaurya

Hey everyone, I've hit this flaky test twice on different gradle checks. Should I wait for a fix/PR before I run gradle check again?

sawansri avatar Jun 17 '25 16:06 sawansri

The failure in testRecoverWhileUnderLoadWithDerivedSource seems to be due to mismatch of source. There are 2 translog entries being compared:

  1. Directly written by the replica
  2. Snapshot received from Peer recovery flow.

The latter derives the source and hence can cause the assertion in the tests to fail as the structure may differ from user provided source (albeit being a congruent object). We should be able fix the test through an improved check.

mgodwan avatar Jun 19 '25 10:06 mgodwan