OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Segment replication] Remove unnecessary fsync calls

Open Poojita-Raj opened this issue 3 years ago • 8 comments

Signed-off-by: Poojita Raj [email protected]

Description

This change removes unnecessary fsync calls not required in segment replication and only keeps the calls needed when we receive a new commit point.

Issues Resolved

Resolves #2333

Check List

  • [ ] New functionality includes testing.
    • [ ] All tests pass
  • [ ] New functionality has been documented.
    • [ ] New functionality has javadoc added
  • [ ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Poojita-Raj avatar Jul 27 '22 18:07 Poojita-Raj

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1045/
  • CommitID: 892a6a2348b6458ea685be8a704d993380b1d30f

github-actions[bot] avatar Jul 27 '22 18:07 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1122/
  • CommitID: 48e3bb5e7d096f7948af7bca276002c63d08ff8f

github-actions[bot] avatar Jul 29 '22 22:07 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1194/
  • CommitID: cf22c136d3e82c459a1296da62d2789369cce575

github-actions[bot] avatar Aug 01 '22 17:08 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: SUCCESS :white_check_mark:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1321/
  • CommitID: 7495f543faef8f15c08c80123fb141bebe475c28

github-actions[bot] avatar Aug 03 '22 00:08 github-actions[bot]

Codecov Report

Merging #4026 (2b00e91) into main (66c24ff) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #4026      +/-   ##
============================================
+ Coverage     70.56%   70.58%   +0.01%     
- Complexity    56938    56953      +15     
============================================
  Files          4588     4588              
  Lines        274132   274136       +4     
  Branches      40178    40180       +2     
============================================
+ Hits         193454   193494      +40     
+ Misses        64452    64358      -94     
- Partials      16226    16284      +58     
Impacted Files Coverage Δ
...g/opensearch/indices/recovery/MultiFileWriter.java 85.00% <100.00%> (+1.16%) :arrow_up:
...ces/replication/common/ReplicationLuceneIndex.java 75.32% <100.00%> (-0.56%) :arrow_down:
...g/opensearch/index/analysis/CharFilterFactory.java 0.00% <0.00%> (-100.00%) :arrow_down:
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) :arrow_down:
.../index/shard/IndexShardNotRecoveringException.java 0.00% <0.00%> (-50.00%) :arrow_down:
...pensearch/indices/breaker/CircuitBreakerStats.java 27.77% <0.00%> (-41.67%) :arrow_down:
...h/action/ingest/SimulateDocumentVerboseResult.java 60.71% <0.00%> (-39.29%) :arrow_down:
...nsearch/index/shard/IndexShardClosedException.java 66.66% <0.00%> (-33.34%) :arrow_down:
...search/search/aggregations/pipeline/EwmaModel.java 24.44% <0.00%> (-28.89%) :arrow_down:
...rc/main/java/org/opensearch/ingest/IngestInfo.java 51.72% <0.00%> (-27.59%) :arrow_down:
... and 480 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Aug 03 '22 00:08 codecov-commenter

Gradle Check (Jenkins) Run Completed with:

  • RESULT: SUCCESS :white_check_mark:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1474/
  • CommitID: 2b00e9138643a9f5896ccae54a40479a05b055a8

github-actions[bot] avatar Aug 05 '22 02:08 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1622/
  • CommitID: 75b7d62521af7647f9830b31260e4be03501d5a9

github-actions[bot] avatar Aug 09 '22 17:08 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: SUCCESS :white_check_mark:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1623/
  • CommitID: ebd748672219a91d20083d1c8b75da47c601270a

github-actions[bot] avatar Aug 09 '22 17:08 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1893/
  • CommitID: e14d75246bea651601b0c1f9d950bd5d96513a48

github-actions[bot] avatar Aug 18 '22 18:08 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1933/
  • CommitID: 311f0b7b7e0897f9fd5f78505dbc7e072449efde

github-actions[bot] avatar Aug 19 '22 09:08 github-actions[bot]

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1937/
  • CommitID: aeb5a504f070066b5ae00944f30e0bd018e74ad1

github-actions[bot] avatar Aug 19 '22 15:08 github-actions[bot]

Last gradle check failed with below, which looks like not related to this PR changes. Refiring!

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary" -Dtests.seed=F35F7D4B4025B70D -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-BO -Dtests.timezone=America/Argentina/Ushuaia -Druntime.java=17

org.opensearch.index.shard.SegmentReplicationIndexShardTests > testNRTReplicaPromotedAsPrimary FAILED
    java.io.UncheckedIOException: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MockIndexInputWrapper(NIOFSIndexInput(path="/var/jenkins/workspace/gradle-check/search/server/build/testrun/test/temp/org.opensearch.index.shard.SegmentReplicationIndexShardTests_F35F7D4B4025B70D-001/tempDir-006/indices/uuid/0/index/segments_5"))))
        at __randomizedtesting.SeedInfo.seed([F35F7D4B4025B70D:38A876E2DCB34CF0]:0)
        at app//org.opensearch.index.store.Store.closeInternal(Store.java:479)
        at app//org.opensearch.index.store.Store$1.closeInternal(Store.java:187)
        at app//org.opensearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:76)
        at app//org.opensearch.index.store.Store.decRef(Store.java:449)
        at app//org.opensearch.index.store.Store.close(Store.java:456)
        at app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:87)
        at app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:129)
        at app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:79)
        at app//org.opensearch.index.shard.IndexShardTestCase.closeShard(IndexShardTestCase.java:720)
        at app//org.opensearch.index.shard.IndexShardTestCase.closeShards(IndexShardTestCase.java:727)
        at app//org.opensearch.index.replication.OpenSearchIndexLevelReplicationTestCase.access$1000(OpenSearchIndexLevelReplicationTestCase.java:128)
        at app//org.opensearch.index.replication.OpenSearchIndexLevelReplicationTestCase$ReplicationGroup.close(OpenSearchIndexLevelReplicationTestCase.java:580)
        at app//org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary(SegmentReplicationIndexShardTests.java:241)

        Caused by:
        org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MockIndexInputWrapper(NIOFSIndexInput(path="/var/jenkins/workspace/gradle-check/search/server/build/testrun/test/temp/org.opensearch.index.shard.SegmentReplicationIndexShardTests_F35F7D4B4025B70D-001/tempDir-006/indices/uuid/0/index/segments_5"))))
            at app//org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:301)
            at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:595)
            at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:552)
            at app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
            at app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
            at app//org.apache.lucene.store.FilterDirectory.close(FilterDirectory.java:111)
            at app//org.apache.lucene.store.FilterDirectory.close(FilterDirectory.java:111)
            at app//org.opensearch.index.store.Store$StoreDirectory.innerClose(Store.java:879)
            at app//org.opensearch.index.store.Store.closeInternal(Store.java:474)
            ... 12 more

            Caused by:
            java.nio.file.NoSuchFileException: _0.si in dir=NIOFSDirectory@/var/jenkins/workspace/gradle-check/search/server/build/testrun/test/temp/org.opensearch.index.shard.SegmentReplicationIndexShardTests_F35F7D4B4025B70D-001/tempDir-006/indices/uuid/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@7fe11310
                at org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:803)
                at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156)
                at org.apache.lucene.tests.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1120)
                at org.apache.lucene.codecs.lucene90.Lucene90SegmentInfoFormat.read(Lucene90SegmentInfoFormat.java:102)
                at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:406)
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363)
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:299)
                ... 20 more

dreamer-89 avatar Aug 19 '22 18:08 dreamer-89

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/1945/
  • CommitID: aeb5a504f070066b5ae00944f30e0bd018e74ad1

github-actions[bot] avatar Aug 19 '22 18:08 github-actions[bot]

@Poojita-Raj So the failed tests in SegmentReplicationIndexShardTests are because we are using LuceneTestCase's newFSDirectory, which on close runs a checkIndex that will run various tests to identify corruption. We aren't executing fsync, so checksum validation will fail.

Its OK to not fsync on every replication event, but we should probably trigger a sync before the NRT engine closes, wdyt?

Edit - after thinking on this I think committing the latest infos by invoking commitSegmentInfos(); from NRTReplicationEngine's closeNoLock would be even better. This method introduced with failover support creates a commit point that references all active segments and fsyncs them, so when the shard starts back up it is able to read the latest commit and start. If the shard starts back up as a replica we'll have a mismatch while computing a replication diff on the segments_N with the one created on the primary, but we can recognize that case in SegmentReplicationTarget. All other segments will be the same cksum.

mch2 avatar Aug 23 '22 21:08 mch2

Gradle Check (Jenkins) Run Completed with:

  • RESULT: FAILURE :x:
  • URL: https://build.ci.opensearch.org/job/gradle-check/2077/
  • CommitID: 130eff9f5d8f6a25446f8f5f9182f9e599d53e88

github-actions[bot] avatar Aug 24 '22 16:08 github-actions[bot]

I think we can actually look at disabling fsyncs entirely on replicas if we are triggering a commit/fsync when the shard closes. That would make this implementation much simpler by overriding sync inside of StoreDirectory to noOp?. Sending incremental segments_N files over the wire is also then not required.

mch2 avatar Aug 25 '22 00:08 mch2