Add support for incremental download of translog files
Description
Presently, the download workflow for remote backed storage works in a manner that causes the download of the same translog files multiple times, each time deleting all the older files before downloading them again. This causes significant wasted network bandwidth, along with the time taken for the shard to become active.
This change adds support for downloading the translog files incrementally and omitting the same if they are present locally.
Implementation
The key changes include-
Incremental Download Logic in RemoteFsTranslog:
- The download logic now checks if the local translog files are present and have the same checksum as the remote files.
- If the local and remote checksums match, the download is skipped for that generation.
- If the checksums do not match, the download is performed for that generation.
Translog Footer Handling:
- The TranslogFooter class has been added to handle the writing and reading of the translog footer, which contains the checksum of the translog data.
- The footer is written when the TranslogWriter is closed, and the checksum is stored in the TranslogReader.
- The TranslogReader is updated to handle reading the footer and ensuring that read requests do not overlap with the footer.
Generation-to-Checksum Mapping in TranslogTransferMetadata:
- The TranslogTransferMetadata now includes a mapping between the translog generation and the corresponding checksum.
- This mapping is used to compare the local and remote checksums during the incremental download process.
- The TranslogTransferMetadataHandler is updated to read and write this generation-to-checksum mapping.
Testing
Manual Testing
Manually tested failover and restore workflow on a 3-node EC2 cluster. Also, tested backward compatibility wherein initial cluster was created and indexed with older OpenSearch process. It was then switched to incremental download.
Newly added tests
- The
RemoteFsTranslogTestsclass has been updated to include new test cases for the incremental download logic:-
testIncrementalDownloadWithMatchingChecksumtests the scenario where the local and remote translog files have the same checksum, and the download is skipped. -
testIncrementalDownloadWithDifferentChecksumtests the scenario where the local and remote translog files have different checksums, and only the missing generation is downloaded.
-
- The
TranslogFooterTestsclass has been added to verify the functionality of theTranslogFooterclass, including writing and reading the footer. - The
TranslogTransferMetadataHandlerTestsclass has been updated to include test cases for handling the generation-to-checksum map in theTranslogTransferMetadata.
Related Issues
One of the optimisations for https://github.com/opensearch-project/OpenSearch/issues/15277
Check List
- [x] Functionality includes testing.
- [ ] API changes companion pull request created, if applicable.
- [ ] Public documentation issue/PR created, if applicable.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
:x: Gradle check result for cd08d89d2dcff2305c2861e2a4281b674477bff6: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for f17a378440278eb50efb5af504524ed1551f4d2f: SUCCESS
Codecov Report
:x: Patch coverage is 81.95489% with 24 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 72.08%. Comparing base (35c366d) to head (1211703).
:warning: Report is 905 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #16204 +/- ##
============================================
- Coverage 72.10% 72.08% -0.03%
+ Complexity 64862 64858 -4
============================================
Files 5307 5308 +1
Lines 302606 302720 +114
Branches 43717 43734 +17
============================================
+ Hits 218208 218228 +20
- Misses 66541 66542 +1
- Partials 17857 17950 +93
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:x: Gradle check result for b8f56ea3cd6046203c81dc7bf0c30f3d5f18c909: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 535bc28e6929046a6dabd99b09ffee02c171ce8e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for ce18c9bc945e59aa279f3083a455c18244697f8f: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 6ea2adbeb30841fc1f0954e4e4ce4f5954f47320: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Rebased over mainline and addressed the comments which needed any refactoring.
:grey_exclamation: Gradle check result for 121170316316b431854e7237b51566df6920a3f5: UNSTABLE
- TEST FAILURES:
1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
This PR is stalled because it has been open for 30 days with no activity.
This PR is stalled because it has been open for 30 days with no activity.
This PR is stalled because it has been open for 30 days with no activity.