OpenSearch [AUTOCUT] Gradle Check Flaky Test Report for FullRollingRestartIT

Flaky Test Report for `FullRollingRestartIT`

Noticed the FullRollingRestartIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference	Merged Pull Request	Build Details	Test Name
8a59ca1d417cb6022531125df09e274997c27ef3	18277	59141	`org.opensearch.recovery.FullRollingRestartIT.classMethod` `org.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithConcurrentUpdatesRollingRestart {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}`
d0c2e39ae05454775b8063e09a88dd5f5834c49f	17797	59086	`org.opensearch.recovery.FullRollingRestartIT.classMethod` `org.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithConcurrentUpdatesRollingRestart {p0={"cluster.indices.replication.strategy":"SEGMENT"}}`
f7ebe22b39f8c2ccae6f22f9b7776995a2c19be1	18488	59157	`org.opensearch.recovery.FullRollingRestartIT.classMethod` `org.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithConcurrentUpdatesRollingRestart {p0={"cluster.indices.replication.strategy":"SEGMENT"}}`
c92b8ea8742b7ae48e5b169c02a255543c4c7b5d	18435	59128	`org.opensearch.recovery.FullRollingRestartIT.testDerivedSourceWithMixedVersionRollingRestart {p0={"cluster.indices.replication.strategy":"DOCUMENT"}}`

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the FullRollingRestartIT class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

Jun 11 '25 06:06 opensearch-ci-bot

Concurrent update on same doc which is causing sequence number conflict, and while waiting on this, test suite is timing out.

java.lang.Exception: Test abandoned because suite timeout was reached.

Бэс 10, 2025 9:11:07 ЭК com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#4762,opensearch[node_t0][generic][T#1],5,TGRP-FullRollingRestartIT]
java.lang.AssertionError: seqNo [100] was processed twice in generation [5], with different data. prvOp [Index{id='76', seqNo=100, primaryTerm=2, version=3, autoGeneratedIdTimestamp=-1}], newOp [Index{id='76', seqNo=100, primaryTerm=2, version=3, autoGeneratedIdTimestamp=-1}]
	at __randomizedtesting.SeedInfo.seed([33B7FED8D60FC88D]:0)
	at org.opensearch.index.translog.TranslogWriter.assertNoSeqNumberConflict(TranslogWriter.java:341)
	at org.opensearch.index.translog.TranslogWriter.add(TranslogWriter.java:289)
	at org.opensearch.index.translog.Translog.add(Translog.java:579)
	at org.opensearch.index.translog.InternalTranslogManager.add(InternalTranslogManager.java:338)
	at org.opensearch.index.engine.NRTReplicationEngine.index(NRTReplicationEngine.java:235)
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1176)
	at org.opensearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:2472)
	at org.opensearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:2459)
	at org.opensearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$4(RecoveryTarget.java:317)
	at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344)
	at org.opensearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:292)
	at

Jun 11 '25 12:06 tanik98

Catch All Triage - 1 2

@tanik98 Is this still a problem?

Jul 28 '25 16:07 andrross

Catch All Triage - 1 2

@tanik98 Is this still a problem?

It's addressed as part of this PR, though not merged yet: https://github.com/opensearch-project/OpenSearch/pull/18565 But it shouldn't be causing the issue currently as, previous PR was reverted.

Jul 29 '25 04:07 tanik98

@ashking94 The testFullRollingRestart and testFullRollingRestart_withNoRecoveryPayloadAndSource test cases started failing around July 6 with no recorded failures in the previous year. It only happens with segment replication. I am unable to reproduce it locally unfortunately. Every failure I've seen is caused by the test timing out while doing a refreshAndWaitForReplication() call (example failure below). There was a change to segrep from commit 46a0045a2301b3bc51e2da9c5e853f69ac6342c2 on July 3, which is just a few days before the first observed failure. Is there any chance that commit introduced a bug in this rolling restart scenario?

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests 'org.opensearch.recovery.FullRollingRestartIT' -Dtests.method='testFullRollingRestart_withNoRecoveryPayloadAndSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}}' -Dtests.seed=961918EE833A9E37 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=fr-NC -Dtests.timezone=CTT -Druntime.java=24

FullRollingRestartIT > testFullRollingRestart_withNoRecoveryPayloadAndSource {p0={"cluster.indices.replication.strategy":"SEGMENT"}} FAILED
    java.lang.AssertionError: replica shards haven't caught up with primary expected:<20> but was:<17>
        at __randomizedtesting.SeedInfo.seed([961918EE833A9E37:A55072AE2A77FF6F]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.test.OpenSearchIntegTestCase.lambda$waitForReplication$1(OpenSearchIntegTestCase.java:2550)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1181)
        at org.opensearch.test.OpenSearchIntegTestCase.waitForReplication(OpenSearchIntegTestCase.java:2539)
        at org.opensearch.test.OpenSearchIntegTestCase.refreshAndWaitForReplication(OpenSearchIntegTestCase.java:2512)
        at org.opensearch.recovery.FullRollingRestartIT.testFullRollingRestart_withNoRecoveryPayloadAndSource(FullRollingRestartIT.java:360)

https://build.ci.opensearch.org/job/gradle-check/61468/consoleText

Aug 01 '25 23:08 andrross

If I expand the dashboard to include all builds (not just timer and post-merge builds), then the first failure occurs on July 3 with no other failures in the previous year. This is more evidence that the commit from #18636 might be the cause.

Aug 01 '25 23:08 andrross

I was able to see a failure locally after running this on an infinite loop:

Assertion failure:

java.lang.AssertionError: replica shards haven't caught up with primary for shardId=[test][7] replicaNode=FoWC4kZbRyGvLVEZqS-8xg expected:<27> but was:<24>

Logs:

[2025-08-06T07:16:32,859][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t4] Replica received new replication checkpoint from primary [ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=13, version=27, size=31725, codec=Lucene101, timestamp=1754464592831887761}]
[2025-08-06T07:16:32,859][DEBUG][o.o.c.s.ClusterApplierService] [node_t4] processing [ApplyCommitRequest{term=1, version=38, sourceNode={node_t0}{I2qwpYqWRBylvYDCy6Gpng}{mwNVElZmR8Oc0WSPFN2tfQ}{127.0.0.1}{127.0.0.1:37723}{dimr}}]: took [0s] done applying updated cluster state (version: 38, uuid: qFmNmYXXR8y5td0Oo4ELtQ)
[2025-08-06T07:16:32,859][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t4] Ignoring new replication checkpoint - shard is currently replicating to checkpoint ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=13, version=27, size=31725, codec=Lucene101, timestamp=1754464592831887761}
[2025-08-06T07:16:32,861][DEBUG][o.o.i.r.SegmentReplicator] [node_t4] Finished replicating Id:[46] Checkpoint [ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=13, version=27, size=31725, codec=Lucene101, timestamp=1754464592831887761}] Shard:[[test][7]] Source:[node_t1] marking as done.
[2025-08-06T07:16:32,861][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t4] [shardId 7] [replication id 46] Replication complete to ReplicationCheckpoint{shardId=[test][7], primaryTerm=1, segmentsGen=9, version=24, size=31725, codec=Lucene101, timestamp=1754464592236686785}, timing data: {INIT=0, OVERALL=27, GET_CHECKPOINT_INFO=1, FINALIZE_REPLICATION=16, FILE_DIFF=6, REPLICATING=0, GET_FILES=2}

Refresh triggered:

[2025-08-06T07:16:32,969][INFO ][o.o.r.FullRollingRestartIT] [testFullRollingRestart] --> refreshing and checking data

Aug 06 '25 07:08 ashking94

Based on the analysis so far, it appears to happen during primary relocation where on relocation the SI version and generation bumps up and does segrep, but the replica on new checkpoint (and it has not received info about the primary change information through cluster state yet) still talks to older primary, receives a stale checkpoint in get checkpoint info tx call and use the same for processing the segrep event. The PR https://github.com/opensearch-project/OpenSearch/pull/18636 fixed an infinite loop issue which was keeping the currently seen issue hidden. I am trying to see if there is a simple way to fix this issue or the test behaviour.

Aug 06 '25 12:08 ashking94

Adding logs seen that clearly explains the above issue (In chronological order):

node_t0: target node_t4: source node_t1: replica node

Primary relocation started

[2025-08-06T22:33:34,915][DEBUG][o.o.i.r.PeerRecoveryTargetService] [node_t4] [test][1] starting recovery from {node_t0}{P5ntM5WNQimjWm3ure1HWQ}{bogfLcGES5uYpcHB_3CYMg}{127.0.0.1}{127.0.0.1:43987}{dimr}{shard_indexing_pressure_enabled=true}

Primary relocation completed and new primary started

[2025-08-06T22:33:35,218][DEBUG][o.o.i.s.IndexShard ] [node_t4] [test][1] state: [RECOVERING]->[POST_RECOVERY], reason [peer recovery done] [2025-08-06T22:33:35,218][DEBUG][o.o.c.a.s.ShardStateAction] [node_t4] sending [internal:cluster/shard/started] to [P5ntM5WNQimjWm3ure1HWQ] for shard entry [StartedShardEntry{shardId [[test][1]], allocationId [UEwh7OYtRXWlVFHb2kKLdA], primary term [1], message [after peer recovery]}] [2025-08-06T22:33:35,218][DEBUG][o.o.t.TransportService ] [node_t4] Action: internal:cluster/shard/started [2025-08-06T22:33:35,218][DEBUG][o.o.i.r.PeerRecoveryTargetService] [node_t4] [test][1] recovery done from [{node_t0}{P5ntM5WNQimjWm3ure1HWQ}{bogfLcGES5uYpcHB_3CYMg}{127.0.0.1}{127.0.0.1:43987}{dimr}{shard_indexing_pressure_enabled=true}], took [365ms] [2025-08-06T22:33:35,218][DEBUG][o.o.c.a.s.ShardStateAction] [node_t0] [test][1] received shard started for [StartedShardEntry{shardId [[test][1]], allocationId [UEwh7OYtRXWlVFHb2kKLdA], primary term [1], message [after peer recovery]}]

Cluster manager published

[2025-08-06T22:33:35,308][DEBUG][o.o.t.TransportService ] [node_t0] Action: internal:cluster/coordination/commit_state

New Checkpoint updated on node_t4

[2025-08-06T22:33:35,315][DEBUG][o.o.i.s.IndexShard ] [node_t4] [test][1] Recomputed ReplicationCheckpoint for shard ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421} [2025-08-06T22:33:35,315][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t4] Checkpoint ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421} has been refreshed

New Checkpoint publishing from node_t4

[2025-08-06T22:33:35,315][DEBUG][o.o.t.TransportService ] [node_t4] Action: indices:admin/publishCheckpoint[p] [2025-08-06T22:33:35,316][DEBUG][o.o.i.r.c.PublishCheckpointAction] [node_t4] shardOperationOnPrimary [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}]

New Checkpoint received on replica node_t1

[2025-08-06T22:33:35,316][DEBUG][o.o.t.TransportService ] [node_t4] Action: indices:admin/publishCheckpoint[r] [2025-08-06T22:33:35,317][DEBUG][o.o.i.r.c.PublishCheckpointAction] [node_t1] doReplicaOperation from primary [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}] [2025-08-06T22:33:35,317][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t1] Replica received new replication checkpoint from primary [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}] [2025-08-06T22:33:35,317][DEBUG][o.o.i.r.SegmentReplicator] [node_t1] Added new replication to collection Id:[27] Checkpoint [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}] Shard:[[test][1]] Source:[node_t0]

New Checkpoint published from node_t4

[2025-08-06T22:33:35,318][DEBUG][o.o.i.r.c.PublishCheckpointAction] [node_t4] [shardId 1] Completed publishing checkpoint [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=9, version=24, size=37455, codec=Lucene101, timestamp=1754481815315516421}], timing: 3

getCheckpointMetadata call being served from node_t0 - returning stale checkpoint

[2025-08-06T22:33:35,319][DEBUG][o.o.i.s.IndexShard ] [node_t0] [test][1] computeReplicationCheckpoint skipped due to same checkpoint ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812207325267} [2025-08-06T22:33:35,319][DEBUG][o.o.i.r.SegmentReplicationSourceService] [node_t0] [replication id 27] Source node sent checkpoint info [ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812207325267}] to target node [LpT0OV1NSA60BjjRYKKKWA], timing: 0

Stale checkpoint received by node_t1

[2025-08-06T22:33:35,325][DEBUG][o.o.i.r.SegmentReplicationTarget] [node_t1] [test][1] Replication diff for checkpoint ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812207325267} [] []

Cluster update being applied on node_t0

[2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] processing [Publication{term=1, version=24}]: execute [2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] cluster state updated, version [24], source [Publication{term=1, version=24}] [2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] applying settings from cluster state with version 24 [2025-08-06T22:33:35,326][DEBUG][o.o.c.s.ClusterApplierService] [node_t0] apply cluster state with version 24

Segrep finalizing on node_t1

[2025-08-06T22:33:35,326][INFO ][o.o.i.r.SegmentReplicationTarget] [node_t1] [test][1] finalizeReplication starting

Shard closing on node_t0

[2025-08-06T22:33:35,326][DEBUG][o.o.i.c.IndicesClusterStateService] [node_t0] [test][1] removing shard (not allocated) [2025-08-06T22:33:35,326][DEBUG][o.o.i.IndexService ] [node_t0] [test] [1] closing... (reason: [removing shard (not allocated)]) [2025-08-06T22:33:35,327][DEBUG][o.o.i.s.IndexShard ] [node_t0] [test][1] state: [STARTED]->[CLOSED], reason [removing shard (not allocated)] [2025-08-06T22:33:35,327][DEBUG][o.o.i.e.Engine ] [node_t0] [test][1] close now acquiring writeLock [2025-08-06T22:33:35,327][DEBUG][o.o.i.e.Engine ] [node_t0] [test][1] close acquired writeLock [2025-08-06T22:33:35,327][DEBUG][o.o.i.t.LocalTranslog ] [node_t0] [test][1] translog closed

Segrep completed with stale checkpoint on node_t0

[2025-08-06T22:33:35,332][DEBUG][o.o.i.r.SegmentReplicationTargetService] [node_t1] [shardId 1] [replication id 27] Replication complete to ReplicationCheckpoint{shardId=[test][1], primaryTerm=1, segmentsGen=4, version=19, size=37455, codec=Lucene101, timestamp=1754481812744020857}, timing data: {INIT=0, OVERALL=14, GET_CHECKPOINT_INFO=1, FINALIZE_REPLICATION=5, FILE_DIFF=5, REPLICATING=0, GET_FILES=0}

Aug 06 '25 13:08 ashking94

I am trying a simple fix locally - https://github.com/ashking94/OpenSearch/commit/50e9421569ac454762e053586d876b603db7a0e7. If this works, I will raise a PR. The test fails in around 50 iterations as of now, will run the test for around 500+ iteration to see if the fix works.

Aug 06 '25 13:08 ashking94