OpenSearch [Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot.

trafficstars

Describe the bug While running some perf tests for Segment Replication I hit an Exception while computing a metadata snapshot off of the primary's latest segment infos. I think this is caused by the primary missing a segments_N file on disk that is currently referenced by the latest SegmentInfos returned from getLatestSegmentInfos. The exception is hit

Segment Infos files: [_1.cfs, _0.cfe, _0.si, _1.cfe, _1.si, _2.si, _0.cfs, _2.cfe, _2.cfs, segments_2]
On disk files: [_0.cfe, _0.cfs, _0.si, _1.cfe, _1.cfs, _1.si, _2.cfe, _2.cfs, _2.si, segments_3, write.lock]

This means the primary needs to refresh its reader before we compute the metadata or we could hit this situation.

Further, we are invoking indexWriter.incRefDeleter while fetching the latest SegmentInfos and returning it as a closeable to release the files. However, incRefDeleter does not incref the segments_N file. - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L5831. This means that this file could be merged away before we finish copying it to a replica.

Trace:

022-08-09T18:10:36,763][WARN ][o.o.c.r.a.AllocationService] [node-3] failing shard [failed shard, shard [so][2], node[QKw3L21KRQeGJ8tmF1ENpQ], [R], s[STARTED], a[id=7ETNwg5VQ9GEpOPHnX1UXw], message [shard failure, reason [replication failure]], failure [OpenSearchException[Segment Replication failed]; nested: RemoteTransportException[[node-3][:9300][internal:index/shard/replication/get_checkpoint_info]]; nested: NoSuchFileException[/home/ec2-user/opensearch-3.0.0/data/nodes/0/indices/sBLLTSH0QtqQ5OVQKJqUDQ/2/index/segments_2]; ], markAsStale [true]]
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-3.0.0.jar:3.0.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [node-3][:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: java.nio.file.NoSuchFileException: /home/ec2-user/opensearch-3.0.0/data/nodes/0/indices/sBLLTSH0QtqQ5OVQKJqUDQ/2/index/segments_2
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:166) ~[opensearch-3.0.0.jar:3.0.0]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.opensearch.index.store.Store$MetadataSnapshot.checksumFromLuceneFile(Store.java:1046) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:1032) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store$MetadataSnapshot.<init>(Store.java:895) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store.getMetadata(Store.java:333) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.common.CopyState.<init>(CopyState.java:58) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.OngoingSegmentReplications.getCachedCopyState(OngoingSegmentReplications.java:83) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:158) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0.jar:3.0.0]
        ... 3 more

Aug 09 '22 20:08 mch2

Looking into it.

Aug 12 '22 18:08 dreamer-89

@mch2 : In order to repro this issue, I am trying to run integration tests (please ignore other changes added for debugging) in SegmentReplicationIT.java, but not able to hit this issue (saw this couple of time previously but not right now). The only issue I am consistently getting is #4205 and #4216

Will keep this issue open and come back in case I see this failure again. Till then focussing on #4216 which is failing more consistently.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.*" -Dtests.seed=9EFE1B4F25B3092E -Dtests.iters=100 -Dtests.opensearch.logger.level=INFO --rerun-tasks

Aug 16 '22 16:08 dreamer-89

I'm able to see this consistently when running benchmarks, but its more difficult getting the primary into this state within an IT. I think at a minimum we should add retries here on the replica instead of failing the shard.

Aug 16 '22 18:08 mch2

Ok figured out a bit more of whats going on here, I think theres two things lacking.

This should not fail the replica - the replica should gracefully retry. However, even after a retry the primary could still be in this state, a primary refresh will not necessarily resolve this if no new segments are flushed.
This happens here when computing a metadata snapshot on the primary:

            final String segmentsFile = segmentInfos.getSegmentsFileName();
            checksumFromLuceneFile(directory, segmentsFile, builder, logger, maxVersion, true);

The primary is in a state where the latest in memory infos has a segments file that is no longer on disk. The on-disk version is one generation ahead. I think in this case we can send the replica the latest segments_N and ignore the missing file. The replica does not actually need the missing _N file, particularly if the primary doesn't either, because we send the SegmentInfos bytes over the wire. I think in this case we could skip the metadata for the missing file and continue?

Aug 18 '22 17:08 mch2

I'm thinking we could do something like this on the primary to return the on-disk infos if the generation is higher than whats in memory -

    @Override
    public GatedCloseable<SegmentInfos> getSegmentInfosSnapshot() {
        final SegmentInfos latestInMemoryInfos = getLatestSegmentInfos();
        SegmentInfos result;
            try {
                final long lastCommitGeneration = SegmentInfos.getLastCommitGeneration(store.directory());
                final long generation = latestInMemoryInfos.getGeneration();
                logger.info("Latest gen {} - {}", generation, lastCommitGeneration);
                if (generation < lastCommitGeneration) {
                    // the latest in memory infos is behind disk, read the latest SegmentInfos from disk and return that.
                    final SegmentInfos latestCommit = SegmentInfos.readLatestCommit(store.directory());
                    result = latestCommit;
                } else {
                    result = latestInMemoryInfos;
                }
                indexWriter.incRefDeleter(result);
            } catch (IOException e) {
                throw new EngineException(shardId, e.getMessage(), e);
            }
        return new GatedCloseable<>(result, () -> indexWriter.decRefDeleter(result));
    }

Aug 19 '22 19:08 mch2

@mch2 : This bug seems to be interesting. Is memory SegmentInfos falling behind on-disk copy an expected state ? Do we need to handle this state while fetching actual files in get_segment_files where we filter in on-disk copies ?

Can this inconsistency b/w memory & disk state be problematic for segment replication ?

Aug 22 '22 23:08 dreamer-89

@mch2 : This bug seems to be interesting. Is memory SegmentInfos falling behind on-disk copy an expected state ?

I think this is happening when the primary commits without creating any new segments, because a subsequent call to refresh does not actually refresh the active reader. It happens quite frequently on the primary in testing. If we return the on-disk version here there is no risk. We'll send the infos byte[] and recreate it on the replica to load into its reader.

Do we need to handle this state while fetching actual files in get_segment_files where we filter in on-disk copies ?

Hmm, I think we need to throw & retry replication if the file missing is not segments_N, All files other than _N will be incRef'd until replication completes. So if one of those files is missing there is a bug. If segments_n is missing the replica will still fetch the byte[] and be able to read any new segments, however in this case we can't clear the replica's xlog & must preserve the latest on-disk commit point, which is done here.

Can this inconsistency b/w memory & disk state be problematic for segment replication ?

With the primary-only version of segrep we need to ensure we have a valid commit point fsync'd on disk at all times that we can potentially restart/recover from & replay ops from xlog.

Aug 22 '22 23:08 mch2

Using below test where without the fix No such file exception happens. Saw one instance where No such file exception was happening but not able to repro that anymore after running below test multiple times.

    public void testDropPrimaryDuringReplication() throws Exception {
        final Settings settings = Settings.builder()
            .put(indexSettings())
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 6)
            .put(IndexMetadata.SETTING_REPLICATION_TYPE, ReplicationType.SEGMENT)
            .build();
        final String clusterManagerNode = internalCluster().startClusterManagerOnlyNode();
        final String primaryNode = internalCluster().startDataOnlyNode(Settings.EMPTY);
        createIndex(INDEX_NAME, settings);
        internalCluster().startDataOnlyNodes(6);

        int initialDocCount = scaledRandomIntBetween(100, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            waitForDocs(initialDocCount, indexer);
            refresh(INDEX_NAME);
            // don't wait for replication to complete, stop the primary immediately.
            internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNode));
            ensureYellow(INDEX_NAME);

            // start another replica.
            internalCluster().startDataOnlyNode();
            ensureGreen(INDEX_NAME);

            // index another doc and refresh - without this the new replica won't catch up.
            client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").get();

            flushAndRefresh(INDEX_NAME);
            waitForReplicaUpdate();
            assertSegmentStats(6);
        }
    }

Aug 26 '22 19:08 dreamer-89

One observation from code scan is that during replica promotion, replica commit SegmentInfos using Store.commitSegmentInfos where post IndexWrite commit, the segment generation starts to differ b/w in memory & on-disk copy of SegmentInfos in the failing tests. This difference happens only in few cases ~~Not sure, if this inconsistency different segment gen can lead to other side-effects.~~

Aug 26 '22 19:08 dreamer-89

@dreamer-89 I've drafted a PR that could solve this by only sending segments over the wire and ignoring commit points - https://github.com/opensearch-project/OpenSearch/pull/4366. I think this is ok because we send the bytes[] of the SegmentInfos to replicas and can commit that if needed for durability. Right now the replica only "commits" if its being promoted as primary, but we'll want to perform commits when a new segment generation is received by the primary and whenever the engine closes. My draft doesn't do that right now, am looking at adding it but may want to split it up into two changes.

Sep 01 '22 06:09 mch2

OpenSearch OpenSearch copied to clipboard

[Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot.

OpenSearch
OpenSearch copied to clipboard