[Segment replication] Remove unnecessary fsync calls
Signed-off-by: Poojita Raj [email protected]
Description
This change removes unnecessary fsync calls not required in segment replication and only keeps the calls needed when we receive a new commit point.
Issues Resolved
Resolves #2333
Check List
- [ ] New functionality includes testing.
- [ ] All tests pass
- [ ] New functionality has been documented.
- [ ] New functionality has javadoc added
- [ ] Commits are signed per the DCO using --signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1045/
- CommitID: 892a6a2348b6458ea685be8a704d993380b1d30f
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1122/
- CommitID: 48e3bb5e7d096f7948af7bca276002c63d08ff8f
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1194/
- CommitID: cf22c136d3e82c459a1296da62d2789369cce575
Gradle Check (Jenkins) Run Completed with:
- RESULT: SUCCESS :white_check_mark:
- URL: https://build.ci.opensearch.org/job/gradle-check/1321/
- CommitID: 7495f543faef8f15c08c80123fb141bebe475c28
Codecov Report
Merging #4026 (2b00e91) into main (66c24ff) will increase coverage by
0.01%. The diff coverage is100.00%.
@@ Coverage Diff @@
## main #4026 +/- ##
============================================
+ Coverage 70.56% 70.58% +0.01%
- Complexity 56938 56953 +15
============================================
Files 4588 4588
Lines 274132 274136 +4
Branches 40178 40180 +2
============================================
+ Hits 193454 193494 +40
+ Misses 64452 64358 -94
- Partials 16226 16284 +58
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...g/opensearch/indices/recovery/MultiFileWriter.java | 85.00% <100.00%> (+1.16%) |
:arrow_up: |
| ...ces/replication/common/ReplicationLuceneIndex.java | 75.32% <100.00%> (-0.56%) |
:arrow_down: |
| ...g/opensearch/index/analysis/CharFilterFactory.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| .../java/org/opensearch/node/NodeClosedException.java | 50.00% <0.00%> (-50.00%) |
:arrow_down: |
| .../index/shard/IndexShardNotRecoveringException.java | 0.00% <0.00%> (-50.00%) |
:arrow_down: |
| ...pensearch/indices/breaker/CircuitBreakerStats.java | 27.77% <0.00%> (-41.67%) |
:arrow_down: |
| ...h/action/ingest/SimulateDocumentVerboseResult.java | 60.71% <0.00%> (-39.29%) |
:arrow_down: |
| ...nsearch/index/shard/IndexShardClosedException.java | 66.66% <0.00%> (-33.34%) |
:arrow_down: |
| ...search/search/aggregations/pipeline/EwmaModel.java | 24.44% <0.00%> (-28.89%) |
:arrow_down: |
| ...rc/main/java/org/opensearch/ingest/IngestInfo.java | 51.72% <0.00%> (-27.59%) |
:arrow_down: |
| ... and 480 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Gradle Check (Jenkins) Run Completed with:
- RESULT: SUCCESS :white_check_mark:
- URL: https://build.ci.opensearch.org/job/gradle-check/1474/
- CommitID: 2b00e9138643a9f5896ccae54a40479a05b055a8
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1622/
- CommitID: 75b7d62521af7647f9830b31260e4be03501d5a9
Gradle Check (Jenkins) Run Completed with:
- RESULT: SUCCESS :white_check_mark:
- URL: https://build.ci.opensearch.org/job/gradle-check/1623/
- CommitID: ebd748672219a91d20083d1c8b75da47c601270a
Gradle Check (Jenkins) Run Completed with:
- RESULT: :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1893/
- CommitID: e14d75246bea651601b0c1f9d950bd5d96513a48
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1933/
- CommitID: 311f0b7b7e0897f9fd5f78505dbc7e072449efde
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1937/
- CommitID: aeb5a504f070066b5ae00944f30e0bd018e74ad1
Last gradle check failed with below, which looks like not related to this PR changes. Refiring!
REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary" -Dtests.seed=F35F7D4B4025B70D -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-BO -Dtests.timezone=America/Argentina/Ushuaia -Druntime.java=17
org.opensearch.index.shard.SegmentReplicationIndexShardTests > testNRTReplicaPromotedAsPrimary FAILED
java.io.UncheckedIOException: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MockIndexInputWrapper(NIOFSIndexInput(path="/var/jenkins/workspace/gradle-check/search/server/build/testrun/test/temp/org.opensearch.index.shard.SegmentReplicationIndexShardTests_F35F7D4B4025B70D-001/tempDir-006/indices/uuid/0/index/segments_5"))))
at __randomizedtesting.SeedInfo.seed([F35F7D4B4025B70D:38A876E2DCB34CF0]:0)
at app//org.opensearch.index.store.Store.closeInternal(Store.java:479)
at app//org.opensearch.index.store.Store$1.closeInternal(Store.java:187)
at app//org.opensearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:76)
at app//org.opensearch.index.store.Store.decRef(Store.java:449)
at app//org.opensearch.index.store.Store.close(Store.java:456)
at app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:87)
at app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:129)
at app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:79)
at app//org.opensearch.index.shard.IndexShardTestCase.closeShard(IndexShardTestCase.java:720)
at app//org.opensearch.index.shard.IndexShardTestCase.closeShards(IndexShardTestCase.java:727)
at app//org.opensearch.index.replication.OpenSearchIndexLevelReplicationTestCase.access$1000(OpenSearchIndexLevelReplicationTestCase.java:128)
at app//org.opensearch.index.replication.OpenSearchIndexLevelReplicationTestCase$ReplicationGroup.close(OpenSearchIndexLevelReplicationTestCase.java:580)
at app//org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary(SegmentReplicationIndexShardTests.java:241)
Caused by:
org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MockIndexInputWrapper(NIOFSIndexInput(path="/var/jenkins/workspace/gradle-check/search/server/build/testrun/test/temp/org.opensearch.index.shard.SegmentReplicationIndexShardTests_F35F7D4B4025B70D-001/tempDir-006/indices/uuid/0/index/segments_5"))))
at app//org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:301)
at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:595)
at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:552)
at app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
at app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
at app//org.apache.lucene.store.FilterDirectory.close(FilterDirectory.java:111)
at app//org.apache.lucene.store.FilterDirectory.close(FilterDirectory.java:111)
at app//org.opensearch.index.store.Store$StoreDirectory.innerClose(Store.java:879)
at app//org.opensearch.index.store.Store.closeInternal(Store.java:474)
... 12 more
Caused by:
java.nio.file.NoSuchFileException: _0.si in dir=NIOFSDirectory@/var/jenkins/workspace/gradle-check/search/server/build/testrun/test/temp/org.opensearch.index.shard.SegmentReplicationIndexShardTests_F35F7D4B4025B70D-001/tempDir-006/indices/uuid/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@7fe11310
at org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:803)
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156)
at org.apache.lucene.tests.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1120)
at org.apache.lucene.codecs.lucene90.Lucene90SegmentInfoFormat.read(Lucene90SegmentInfoFormat.java:102)
at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:406)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:299)
... 20 more
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/1945/
- CommitID: aeb5a504f070066b5ae00944f30e0bd018e74ad1
@Poojita-Raj So the failed tests in SegmentReplicationIndexShardTests are because we are using LuceneTestCase's newFSDirectory, which on close runs a checkIndex that will run various tests to identify corruption. We aren't executing fsync, so checksum validation will fail.
Its OK to not fsync on every replication event, but we should probably trigger a sync before the NRT engine closes, wdyt?
Edit - after thinking on this I think committing the latest infos by invoking commitSegmentInfos(); from NRTReplicationEngine's closeNoLock would be even better. This method introduced with failover support creates a commit point that references all active segments and fsyncs them, so when the shard starts back up it is able to read the latest commit and start. If the shard starts back up as a replica we'll have a mismatch while computing a replication diff on the segments_N with the one created on the primary, but we can recognize that case in SegmentReplicationTarget. All other segments will be the same cksum.
Gradle Check (Jenkins) Run Completed with:
- RESULT: FAILURE :x:
- URL: https://build.ci.opensearch.org/job/gradle-check/2077/
- CommitID: 130eff9f5d8f6a25446f8f5f9182f9e599d53e88
I think we can actually look at disabling fsyncs entirely on replicas if we are triggering a commit/fsync when the shard closes. That would make this implementation much simpler by overriding sync inside of StoreDirectory to noOp?. Sending incremental segments_N files over the wire is also then not required.