OpenSearch
OpenSearch copied to clipboard
Fix flaky SegRep test testScrollCreatedOnReplica
Description
This change fixes flakiness with segrep test for scroll requests. The flakiness was with an assertion that all segments snapshotted by the scroll were deleted from disk after the scroll was cleared. The failures were caused because the snapshotted segments were still referenced by the latest on-disk commit point that is always preserved.
The test was doing a lot of crazy random merges/deletes etc that made it difficult to follow why those files were sneaking in as part of the latest commit. To make this easier to reason about I've simplified the test flow to:
- Index 1 doc
- Create the scroll
- Index a few more docs
- Refresh to create segments
- Copy out segments
- force merge to 1 segment
- Copy out post force merge
- Assert segments still exist on-disk
- Clear scroll & assert segments are cleared
I've run this ~6k times without failure.
This PR also contains a bug fix to ReplicaFileDeleter to take a Consumer<String> instead of a BiConsumer<String, String>. We use Store::deleteQuiet to purge files from disk which accepts a varargs of file names and does not have a reason parameter. This ensures we aren't passing the literal "delete unreferenced" as a file to be deleted. This would get passed to deleteQuiet and a NoSuchFileException would get swallowed.
Related Issues
Resolves #10769
Check List
- [x] New functionality includes testing.
- [x] All tests pass
- [x] New functionality has been documented.
- [x] New functionality has javadoc added
- [x] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
- [x] Commits are signed per the DCO using --signoff
- [x] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
- [x] Public documentation issue/PR created
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Compatibility status:
Checks if related components are compatible with change f9184b9
Incompatible components
Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/cross-cluster-replication.git]
Skipped components
Compatible components
Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git]
Grr, hit a failure here locally after about ~1k more iterations for the remote store version of this test where segments are still part of the latest local commit on the replica. I see this error in logs but i'm not sure its related... debugging:
[2024-01-30T01:03:06,153][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [test-idx-1][0] Exception in RemoteStoreRefreshListener.afterRefresh()
java.lang.AssertionError: already started
at org.opensearch.indices.replication.common.ReplicationTimer.start(ReplicationTimer.java:51) ~[classes/:?]
at org.opensearch.index.seqno.ReplicationTracker.lambda$startReplicationLagTimers$22(ReplicationTracker.java:1285) ~[classes/:?]
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
at org.opensearch.index.seqno.ReplicationTracker.startReplicationLagTimers(ReplicationTracker.java:1277) ~[classes/:?]
at org.opensearch.index.shard.IndexShard.onCheckpointPublished(IndexShard.java:1936) ~[classes/:?]
at org.opensearch.indices.replication.checkpoint.SegmentReplicationCheckpointPublisher.publish(SegmentReplicationCheckpointPublisher.java:39) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.onSuccessfulSegmentsSync(RemoteStoreRefreshListener.java:309) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:229) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:220) ~[classes/:?]
at org.opensearch.action.LatchedActionListener.onResponse(LatchedActionListener.java:58) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:378) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [classes/:?]
at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [classes/:?]
at org.opensearch.index.shard.ReleasableRetryableRefreshListener.afterRefresh(ReleasableRetryableRefreshListener.java:66) [classes/:?]
at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1771) [classes/:?]
at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1886) [classes/:?]
at org.opensearch.index.engine.InternalEngine.forceMerge(InternalEngine.java:2025) [classes/:?]
at org.opensearch.index.shard.IndexShard.forceMerge(IndexShard.java:1521) [classes/:?]
at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:113) [classes/:?]
at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:60) [classes/:?]
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:495) [classes/:?]
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:469) [classes/:?]
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456) [classes/:?]
at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) [classes/:?]
at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480) [classes/:?]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [classes/:?]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
:white_check_mark: Gradle check result for a12eee11fa30f225cfc46d078fef145e20ed557a: SUCCESS
Don't think that trace is the cause here. The issue is the replica is not making a new local commit after receiving the new segments. I think this may be that we need a force flush. Going to leave this running overnight to confirm.
Ok the issue here is the refresh upon force merge is refreshing on the newly created merged into segment but the commit has not yet been made.
With some added logs in InternalEngine ---
[2024-01-30T16:16:19,237][INFO ][o.o.i.e.Engine ] [node_t2] [test-idx-1][0] After refresh Segments gen: 4 Segments version: 23 Files: [_5.kdm, _5.fdt, _5_Lucene90_0.dvm, _5_Lucene99_0.tmd, _5.kdi, _5.fdm, _5.kdd, _5.fnm, _5_Lucene90_0.dvd, _5_Lucene99_0.tip, _5_Lucene99_0.tim, _5_Lucene99_0.doc, _5.si, _5.fdx]
Segment 5 is our newly force merge'd down to 1 segment, but commit gen is still 4. A competing refresh is running after the new segments are created but the flush has not finished. The replica will see commit gen is not yet updated so it has no reason to commit locally.
The fix here is to disable scheduled refresh. We can also repro this more consistently by reducing refresh interval to 1ms.
:x: Gradle check result for d5002a2dee02cc9a404bce5f795a9654dc1206f4:
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
❌ Gradle check result for d5002a2:
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
https://github.com/opensearch-project/OpenSearch/issues/9191 https://github.com/opensearch-project/OpenSearch/issues/10755
:x: Gradle check result for 8200c983ad153c5fdd8267faf17e269330bc6fbc: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:grey_exclamation: Gradle check result for d5002a2dee02cc9a404bce5f795a9654dc1206f4: UNSTABLE
- TEST FAILURES:
1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled
1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
❕ Gradle check result for d5002a2: UNSTABLE
- TEST FAILURES:
1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled 1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexingPlease review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
https://github.com/opensearch-project/OpenSearch/issues/10755 https://github.com/opensearch-project/OpenSearch/issues/9191
:x: Gradle check result for f9184b94c86748c17cd6c8814a882b729568ee15: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
❌ Gradle check result for f9184b9: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
This one is a bit odd - I don't think this is related to this PR but I don't think its test specific flakiness - cut https://github.com/opensearch-project/OpenSearch/issues/12114 with more
:white_check_mark: Gradle check result for f9184b94c86748c17cd6c8814a882b729568ee15: SUCCESS