OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Fix flaky SegRep test testScrollCreatedOnReplica

Open mch2 opened this issue 1 year ago • 13 comments

Description

This change fixes flakiness with segrep test for scroll requests. The flakiness was with an assertion that all segments snapshotted by the scroll were deleted from disk after the scroll was cleared. The failures were caused because the snapshotted segments were still referenced by the latest on-disk commit point that is always preserved.

The test was doing a lot of crazy random merges/deletes etc that made it difficult to follow why those files were sneaking in as part of the latest commit. To make this easier to reason about I've simplified the test flow to:

  1. Index 1 doc
  2. Create the scroll
  3. Index a few more docs
  4. Refresh to create segments
  5. Copy out segments
  6. force merge to 1 segment
  7. Copy out post force merge
  8. Assert segments still exist on-disk
  9. Clear scroll & assert segments are cleared

I've run this ~6k times without failure.

This PR also contains a bug fix to ReplicaFileDeleter to take a Consumer<String> instead of a BiConsumer<String, String>. We use Store::deleteQuiet to purge files from disk which accepts a varargs of file names and does not have a reason parameter. This ensures we aren't passing the literal "delete unreferenced" as a file to be deleted. This would get passed to deleteQuiet and a NoSuchFileException would get swallowed.

Related Issues

Resolves #10769

Check List

  • [x] New functionality includes testing.
    • [x] All tests pass
  • [x] New functionality has been documented.
    • [x] New functionality has javadoc added
  • [x] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • [x] Commits are signed per the DCO using --signoff
  • [x] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [x] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

mch2 avatar Jan 29 '24 23:01 mch2

Compatibility status:

Checks if related components are compatible with change f9184b9

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git]

github-actions[bot] avatar Jan 30 '24 00:01 github-actions[bot]

Grr, hit a failure here locally after about ~1k more iterations for the remote store version of this test where segments are still part of the latest local commit on the replica. I see this error in logs but i'm not sure its related... debugging:

[2024-01-30T01:03:06,153][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [test-idx-1][0] Exception in RemoteStoreRefreshListener.afterRefresh()
java.lang.AssertionError: already started
	at org.opensearch.indices.replication.common.ReplicationTimer.start(ReplicationTimer.java:51) ~[classes/:?]
	at org.opensearch.index.seqno.ReplicationTracker.lambda$startReplicationLagTimers$22(ReplicationTracker.java:1285) ~[classes/:?]
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
	at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
	at org.opensearch.index.seqno.ReplicationTracker.startReplicationLagTimers(ReplicationTracker.java:1277) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.onCheckpointPublished(IndexShard.java:1936) ~[classes/:?]
	at org.opensearch.indices.replication.checkpoint.SegmentReplicationCheckpointPublisher.publish(SegmentReplicationCheckpointPublisher.java:39) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.onSuccessfulSegmentsSync(RemoteStoreRefreshListener.java:309) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:229) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:220) ~[classes/:?]
	at org.opensearch.action.LatchedActionListener.onResponse(LatchedActionListener.java:58) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:378) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.afterRefresh(ReleasableRetryableRefreshListener.java:66) [classes/:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1771) [classes/:?]
	at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1886) [classes/:?]
	at org.opensearch.index.engine.InternalEngine.forceMerge(InternalEngine.java:2025) [classes/:?]
	at org.opensearch.index.shard.IndexShard.forceMerge(IndexShard.java:1521) [classes/:?]
	at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:113) [classes/:?]
	at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:60) [classes/:?]
	at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:495) [classes/:?]
	at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:469) [classes/:?]
	at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456) [classes/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) [classes/:?]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

mch2 avatar Jan 30 '24 00:01 mch2

:white_check_mark: Gradle check result for a12eee11fa30f225cfc46d078fef145e20ed557a: SUCCESS

github-actions[bot] avatar Jan 30 '24 00:01 github-actions[bot]

Don't think that trace is the cause here. The issue is the replica is not making a new local commit after receiving the new segments. I think this may be that we need a force flush. Going to leave this running overnight to confirm.

mch2 avatar Jan 30 '24 00:01 mch2

Ok the issue here is the refresh upon force merge is refreshing on the newly created merged into segment but the commit has not yet been made.

With some added logs in InternalEngine ---

[2024-01-30T16:16:19,237][INFO ][o.o.i.e.Engine           ] [node_t2] [test-idx-1][0] After refresh Segments gen: 4 Segments version: 23 Files: [_5.kdm, _5.fdt, _5_Lucene90_0.dvm, _5_Lucene99_0.tmd, _5.kdi, _5.fdm, _5.kdd, _5.fnm, _5_Lucene90_0.dvd, _5_Lucene99_0.tip, _5_Lucene99_0.tim, _5_Lucene99_0.doc, _5.si, _5.fdx]

Segment 5 is our newly force merge'd down to 1 segment, but commit gen is still 4. A competing refresh is running after the new segments are created but the flush has not finished. The replica will see commit gen is not yet updated so it has no reason to commit locally.

The fix here is to disable scheduled refresh. We can also repro this more consistently by reducing refresh interval to 1ms.

mch2 avatar Jan 30 '24 20:01 mch2

:x: Gradle check result for d5002a2dee02cc9a404bce5f795a9654dc1206f4:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jan 30 '24 22:01 github-actions[bot]

❌ Gradle check result for d5002a2:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

https://github.com/opensearch-project/OpenSearch/issues/9191 https://github.com/opensearch-project/OpenSearch/issues/10755

mch2 avatar Jan 30 '24 23:01 mch2

:x: Gradle check result for 8200c983ad153c5fdd8267faf17e269330bc6fbc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jan 31 '24 00:01 github-actions[bot]

:grey_exclamation: Gradle check result for d5002a2dee02cc9a404bce5f795a9654dc1206f4: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions[bot] avatar Jan 31 '24 00:01 github-actions[bot]

❕ Gradle check result for d5002a2: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

https://github.com/opensearch-project/OpenSearch/issues/10755 https://github.com/opensearch-project/OpenSearch/issues/9191

mch2 avatar Jan 31 '24 17:01 mch2

:x: Gradle check result for f9184b94c86748c17cd6c8814a882b729568ee15: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jan 31 '24 18:01 github-actions[bot]

❌ Gradle check result for f9184b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

This one is a bit odd - I don't think this is related to this PR but I don't think its test specific flakiness - cut https://github.com/opensearch-project/OpenSearch/issues/12114 with more

mch2 avatar Jan 31 '24 19:01 mch2

:white_check_mark: Gradle check result for f9184b94c86748c17cd6c8814a882b729568ee15: SUCCESS

github-actions[bot] avatar Jan 31 '24 20:01 github-actions[bot]