[AUTOCUT] Gradle Check Flaky Test Report for FullRollingRestartIT
Flaky Test Report for FullRollingRestartIT
Noticed the FullRollingRestartIT has some flaky, failing tests that failed during post-merge actions.
Details
| Git Reference | Merged Pull Request | Build Details | Test Name |
|---|---|---|---|
| 8a59ca1d417cb6022531125df09e274997c27ef3 | 18277 | 59141 | org.opensearch.recovery.FullRollingRestartIT.classMethodorg.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithConcurrentUpdatesRollingRestart {p0={"cluster.indices.replication.strategy":"DOCUMENT"}} |
| d0c2e39ae05454775b8063e09a88dd5f5834c49f | 17797 | 59086 | org.opensearch.recovery.FullRollingRestartIT.classMethodorg.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithConcurrentUpdatesRollingRestart {p0={"cluster.indices.replication.strategy":"SEGMENT"}} |
| f7ebe22b39f8c2ccae6f22f9b7776995a2c19be1 | 18488 | 59157 | org.opensearch.recovery.FullRollingRestartIT.classMethodorg.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithConcurrentUpdatesRollingRestart {p0={"cluster.indices.replication.strategy":"SEGMENT"}} |
| c92b8ea8742b7ae48e5b169c02a255543c4c7b5d | 18435 | 59128 | org.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithMixedVersionRollingRestart {p0={"cluster.indices.replication.strategy":"DOCUMENT"}} |
The other pull requests, besides those involved in post-merge actions, that contain failing tests with the FullRollingRestartIT class are:
For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.
Concurrent update on same doc which is causing sequence number conflict, and while waiting on this, test suite is timing out.
java.lang.Exception: Test abandoned because suite timeout was reached.
Бэс 10, 2025 9:11:07 ЭК com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#4762,opensearch[node_t0][generic][T#1],5,TGRP-FullRollingRestartIT]
java.lang.AssertionError: seqNo [100] was processed twice in generation [5], with different data. prvOp [Index{id='76', seqNo=100, primaryTerm=2, version=3, autoGeneratedIdTimestamp=-1}], newOp [Index{id='76', seqNo=100, primaryTerm=2, version=3, autoGeneratedIdTimestamp=-1}]
at __randomizedtesting.SeedInfo.seed([33B7FED8D60FC88D]:0)
at org.opensearch.index.translog.TranslogWriter.assertNoSeqNumberConflict(TranslogWriter.java:341)
at org.opensearch.index.translog.TranslogWriter.add(TranslogWriter.java:289)
at org.opensearch.index.translog.Translog.add(Translog.java:579)
at org.opensearch.index.translog.InternalTranslogManager.add(InternalTranslogManager.java:338)
at org.opensearch.index.engine.NRTReplicationEngine.index(NRTReplicationEngine.java:235)
at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1176)
at org.opensearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:2472)
at org.opensearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:2459)
at org.opensearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$4(RecoveryTarget.java:317)
at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344)
at org.opensearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:292)
at
@tanik98 Is this still a problem?
It's addressed as part of this PR, though not merged yet: https://github.com/opensearch-project/OpenSearch/pull/18565 But it shouldn't be causing the issue currently as, previous PR was reverted.
@ashking94 The testFullRollingRestart and testFullRollingRestart_withNoRecoveryPayloadAndSource test cases started failing around July 6 with no recorded failures in the previous year. It only happens with segment replication. I am unable to reproduce it locally unfortunately. Every failure I've seen is caused by the test timing out while doing a refreshAndWaitForReplication() call (example failure below). There was a change to segrep from commit 46a0045a2301b3bc51e2da9c5e853f69ac6342c2 on July 3, which is just a few days before the first observed failure. Is there any chance that commit introduced a bug in this rolling restart scenario?
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests 'org.opensearch.recovery.FullRollingRestartIT' -Dtests.method='testFullRollingRestart_withNoRecoveryPayloadAndSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}' -Dtests.seed=961918EE833A9E37 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=fr-NC -Dtests.timezone=CTT -Druntime.java=24
FullRollingRestartIT > testFullRollingRestart_withNoRecoveryPayloadAndSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}} FAILED
java.lang.AssertionError: replica shards haven't caught up with primary expected:<20> but was:<17>
at __randomizedtesting.SeedInfo.seed([961918EE833A9E37:A55072AE2A77FF6F]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.opensearch.test.OpenSearchIntegTestCase.lambda$waitForReplication$1(OpenSearchIntegTestCase.java:2550)
at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1181)
at org.opensearch.test.OpenSearchIntegTestCase.waitForReplication(OpenSearchIntegTestCase.java:2539)
at org.opensearch.test.OpenSearchIntegTestCase.refreshAndWaitForReplication(OpenSearchIntegTestCase.java:2512)
at org.opensearch.recovery.FullRollingRestartIT.testFullRollingRestart_withNoRecoveryPayloadAndSource(FullRollingRestartIT.java:360)
https://build.ci.opensearch.org/job/gradle-check/61468/consoleText
If I expand the dashboard to include all builds (not just timer and post-merge builds), then the first failure occurs on July 3 with no other failures in the previous year. This is more evidence that the commit from #18636 might be the cause.
I was able to see a failure locally after running this on an infinite loop:
Assertion failure:
java.lang.AssertionError: replica shards haven't caught up with primary for shardId=[test][7] replicaNode=FoWC4kZbRyGvLVEZqS-8xg expected:<27> but was:<24>
Logs:
[2025-08-06T07:16:32,859][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t4] Replica received new replication checkpoint from primary [ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=13, version=27, size=31725, codec=Lucene101, timestamp=1754464592831887761}]
[2025-08-06T07:16:32,859][DEBUG][o.o.c.s.ClusterApplierService] [node_t4] processing [ApplyCommitRequest{term=1, version=38, sourceNode={node_t0}{I2qwpYqWRBylvYDCy6Gpng}{mwNVElZmR8Oc0WSPFN2tfQ}{127.0.0.1}{127.0.0.1:37723}{dimr}}]: took [0s] done applying updated cluster state (version: 38, uuid: qFmNmYXXR8y5td0Oo4ELtQ)
[2025-08-06T07:16:32,859][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t4] Ignoring new replication checkpoint - shard is currently replicating to checkpoint ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=13, version=27, size=31725, codec=Lucene101, timestamp=1754464592831887761}
[2025-08-06T07:16:32,861][DEBUG][o.o.i.r.SegmentReplicator] [node_t4] Finished replicating Id:[46] Checkpoint [ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=13, version=27, size=31725, codec=Lucene101, timestamp=1754464592831887761}] Shard:[[test][7]] Source:[node_t1] marking as done.
[2025-08-06T07:16:32,861][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t4] [shardId 7] [replication id 46] Replication complete to ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=9, version=24, size=31725, codec=Lucene101, timestamp=1754464592236686785}, timing data: {INIT=0, OVERALL=27, GET_CHECKPOINT_INFO=1, FINALIZE_REPLICATION=16, FILE_DIFF=6, REPLICATING=0, GET_FILES=2}
Refresh triggered:
[2025-08-06T07:16:32,969][INFO ][o.o.r.FullRollingRestartIT] [testFullRollingRestart] --> refreshing and checking data
Based on the analysis so far, it appears to happen during primary relocation where on relocation the SI version and generation bumps up and does segrep, but the replica on new checkpoint (and it has not received info about the primary change information through cluster state yet) still talks to older primary, receives a stale checkpoint in get checkpoint info tx call and use the same for processing the segrep event. The PR https://github.com/opensearch-project/OpenSearch/pull/18636 fixed an infinite loop issue which was keeping the currently seen issue hidden. I am trying to see if there is a simple way to fix this issue or the test behaviour.
Adding logs seen that clearly explains the above issue (In chronological order):
node_t0: target node_t4: source node_t1: replica node
Primary relocation started
[2025-08-06T22:33:34,915][DEBUG][o.o.i.r.PeerRecoveryTargetService] [node_t4] [test][1] starting recovery from {node_t0}{P5ntM5WNQimjWm3ure1HWQ}{bogfLcGES5uYpcHB_3CYMg}{127.0.0.1}{127.0.0.1:43987}{dimr}{shard_indexing_pressure_enabled=true}
Primary relocation completed and new primary started
[2025-08-06T22:33:35,218][DEBUG][o.o.i.s.IndexShard ] [node_t4] [test][1] state: [RECOVERING]->[POST_RECOVERY], reason [peer recovery done] [2025-08-06T22:33:35,218][DEBUG][o.o.c.a.s.ShardStateAction] [node_t4] sending [internal:cluster/shard/started] to [P5ntM5WNQimjWm3ure1HWQ] for shard entry [StartedShardEntry{shardId [[test][1]], allocationId [UEwh7OYtRXWlVFHb2kKLdA], primary term [1], message [after peer recovery]}] [2025-08-06T22:33:35,218][DEBUG][o.o.t.TransportService ] [node_t4] Action: internal:cluster/shard/started [2025-08-06T22:33:35,218][DEBUG][o.o.i.r.PeerRecoveryTargetService] [node_t4] [test][1] recovery done from [{node_t0}{P5ntM5WNQimjWm3ure1HWQ}{bogfLcGES5uYpcHB_3CYMg}{127.0.0.1}{127.0.0.1:43987}{dimr}{shard_indexing_pressure_enabled=true}], took [365ms] [2025-08-06T22:33:35,218][DEBUG][o.o.c.a.s.ShardStateAction] [node_t0] [test][1] received shard started for [StartedShardEntry{shardId [[test][1]], allocationId [UEwh7OYtRXWlVFHb2kKLdA], primary term [1], message [after peer recovery]}]
Cluster manager published
[2025-08-06T22:33:35,308][DEBUG][o.o.t.TransportService ] [node_t0] Action: internal:cluster/coordination/commit_state
New Checkpoint updated on node_t4
[2025-08-06T22:33:35,315][DEBUG][o.o.i.s.IndexShard ] [node_t4] [test][1] Recomputed ReplicationCheckpoint for shard ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421} [2025-08-06T22:33:35,315][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t4] Checkpoint ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421} has been refreshed
New Checkpoint publishing from node_t4
[2025-08-06T22:33:35,315][DEBUG][o.o.t.TransportService ] [node_t4] Action: indices:admin/publishCheckpoint[p] [2025-08-06T22:33:35,316][DEBUG][o.o.i.r.c.PublishCheckpointAction] [node_t4] shardOperationOnPrimary [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}]
New Checkpoint received on replica node_t1
[2025-08-06T22:33:35,316][DEBUG][o.o.t.TransportService ] [node_t4] Action: indices:admin/publishCheckpoint[r] [2025-08-06T22:33:35,317][DEBUG][o.o.i.r.c.PublishCheckpointAction] [node_t1] doReplicaOperation from primary [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}] [2025-08-06T22:33:35,317][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t1] Replica received new replication checkpoint from primary [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}] [2025-08-06T22:33:35,317][DEBUG][o.o.i.r.SegmentReplicator] [node_t1] Added new replication to collection Id:[27] Checkpoint [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}] Shard:[[test][1]] Source:[node_t0]
New Checkpoint published from node_t4
[2025-08-06T22:33:35,318][DEBUG][o.o.i.r.c.PublishCheckpointAction] [node_t4] [shardId 1] Completed publishing checkpoint [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}], timing: 3
getCheckpointMetadata call being served from node_t0 - returning stale checkpoint
[2025-08-06T22:33:35,319][DEBUG][o.o.i.s.IndexShard ] [node_t0] [test][1] computeReplicationCheckpoint skipped due to same checkpoint ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812207325267} [2025-08-06T22:33:35,319][DEBUG][o.o.i.r.SegmentReplicationSourceService] [node_t0] [replication id 27] Source node sent checkpoint info [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812207325267}] to target node [LpT0OV1NSA60BjjRYKKKWA], timing: 0
Stale checkpoint received by node_t1
[2025-08-06T22:33:35,325][DEBUG][o.o.i.r.SegmentReplicationTarget] [node_t1] [test][1] Replication diff for checkpoint ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812207325267} [] []
Cluster update being applied on node_t0
[2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] processing [Publication{term=1, version=24}]: execute [2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] cluster state updated, version [24], source [Publication{term=1, version=24}] [2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] applying settings from cluster state with version 24 [2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] apply cluster state with version 24
Segrep finalizing on node_t1
[2025-08-06T22:33:35,326][INFO ][o.o.i.r.SegmentReplicationTarget] [node_t1] [test][1] finalizeReplication starting
Shard closing on node_t0
[2025-08-06T22:33:35,326][DEBUG][o.o.i.c.IndicesClusterStateService] [node_t0] [test][1] removing shard (not allocated) [2025-08-06T22:33:35,326][DEBUG][o.o.i.IndexService ] [node_t0] [test] [1] closing... (reason: [removing shard (not allocated)]) [2025-08-06T22:33:35,327][DEBUG][o.o.i.s.IndexShard ] [node_t0] [test][1] state: [STARTED]->[CLOSED], reason [removing shard (not allocated)] [2025-08-06T22:33:35,327][DEBUG][o.o.i.e.Engine ] [node_t0] [test][1] close now acquiring writeLock [2025-08-06T22:33:35,327][DEBUG][o.o.i.e.Engine ] [node_t0] [test][1] close acquired writeLock [2025-08-06T22:33:35,327][DEBUG][o.o.i.t.LocalTranslog ] [node_t0] [test][1] translog closed
Segrep completed with stale checkpoint on node_t0
[2025-08-06T22:33:35,332][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t1] [shardId 1] [replication id 27] Replication complete to ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812744020857}, timing data: {INIT=0, OVERALL=14, GET_CHECKPOINT_INFO=1, FINALIZE_REPLICATION=5, FILE_DIFF=5, REPLICATING=0, GET_FILES=0}
I am trying a simple fix locally - https://github.com/ashking94/OpenSearch/commit/50e9421569ac454762e053586d876b603db7a0e7. If this works, I will raise a PR. The test fails in around 50 iterations as of now, will run the test for around 500+ iteration to see if the fix works.