OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Add support for incremental download of translog files

Open rawahars opened this issue 1 year ago • 12 comments

Description

Presently, the download workflow for remote backed storage works in a manner that causes the download of the same translog files multiple times, each time deleting all the older files before downloading them again. This causes significant wasted network bandwidth, along with the time taken for the shard to become active.

This change adds support for downloading the translog files incrementally and omitting the same if they are present locally.

Implementation

The key changes include-

Incremental Download Logic in RemoteFsTranslog:

  • The download logic now checks if the local translog files are present and have the same checksum as the remote files.
    • If the local and remote checksums match, the download is skipped for that generation.
    • If the checksums do not match, the download is performed for that generation.

Translog Footer Handling:

  • The TranslogFooter class has been added to handle the writing and reading of the translog footer, which contains the checksum of the translog data.
  • The footer is written when the TranslogWriter is closed, and the checksum is stored in the TranslogReader.
  • The TranslogReader is updated to handle reading the footer and ensuring that read requests do not overlap with the footer.

Generation-to-Checksum Mapping in TranslogTransferMetadata:

  • The TranslogTransferMetadata now includes a mapping between the translog generation and the corresponding checksum.
  • This mapping is used to compare the local and remote checksums during the incremental download process.
  • The TranslogTransferMetadataHandler is updated to read and write this generation-to-checksum mapping.

Testing

Manual Testing

Manually tested failover and restore workflow on a 3-node EC2 cluster. Also, tested backward compatibility wherein initial cluster was created and indexed with older OpenSearch process. It was then switched to incremental download.

Newly added tests

  • The RemoteFsTranslogTests class has been updated to include new test cases for the incremental download logic:
    • testIncrementalDownloadWithMatchingChecksum tests the scenario where the local and remote translog files have the same checksum, and the download is skipped.
    • testIncrementalDownloadWithDifferentChecksum tests the scenario where the local and remote translog files have different checksums, and only the missing generation is downloaded.
  • The TranslogFooterTests class has been added to verify the functionality of the TranslogFooter class, including writing and reading the footer.
  • The TranslogTransferMetadataHandlerTests class has been updated to include test cases for handling the generation-to-checksum map in the TranslogTransferMetadata.

Related Issues

One of the optimisations for https://github.com/opensearch-project/OpenSearch/issues/15277

Check List

  • [x] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

rawahars avatar Oct 07 '24 10:10 rawahars

:x: Gradle check result for cd08d89d2dcff2305c2861e2a4281b674477bff6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 07 '24 11:10 github-actions[bot]

:white_check_mark: Gradle check result for f17a378440278eb50efb5af504524ed1551f4d2f: SUCCESS

github-actions[bot] avatar Oct 07 '24 13:10 github-actions[bot]

Codecov Report

:x: Patch coverage is 81.95489% with 24 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 72.08%. Comparing base (35c366d) to head (1211703). :warning: Report is 905 commits behind head on main.

Files with missing lines Patch % Lines
.../org/opensearch/index/translog/TranslogFooter.java 70.96% 6 Missing and 3 partials :warning:
.../org/opensearch/index/translog/TranslogWriter.java 73.68% 2 Missing and 3 partials :warning:
...rg/opensearch/index/translog/RemoteFsTranslog.java 84.00% 3 Missing and 1 partial :warning:
.../org/opensearch/index/translog/TranslogReader.java 76.47% 1 Missing and 3 partials :warning:
...n/java/org/opensearch/index/translog/Translog.java 75.00% 1 Missing :warning:
...dex/translog/transfer/TranslogTransferManager.java 90.90% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16204      +/-   ##
============================================
- Coverage     72.10%   72.08%   -0.03%     
+ Complexity    64862    64858       -4     
============================================
  Files          5307     5308       +1     
  Lines        302606   302720     +114     
  Branches      43717    43734      +17     
============================================
+ Hits         218208   218228      +20     
- Misses        66541    66542       +1     
- Partials      17857    17950      +93     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 07 '24 13:10 codecov[bot]

:x: Gradle check result for b8f56ea3cd6046203c81dc7bf0c30f3d5f18c909: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 08 '24 09:10 github-actions[bot]

:x: Gradle check result for 535bc28e6929046a6dabd99b09ffee02c171ce8e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 09 '24 06:10 github-actions[bot]

:x: Gradle check result for ce18c9bc945e59aa279f3083a455c18244697f8f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 09 '24 10:10 github-actions[bot]

:x: Gradle check result for 6ea2adbeb30841fc1f0954e4e4ce4f5954f47320: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 10 '24 06:10 github-actions[bot]

Rebased over mainline and addressed the comments which needed any refactoring.

rawahars avatar Oct 15 '24 08:10 rawahars

:grey_exclamation: Gradle check result for 121170316316b431854e7237b51566df6920a3f5: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions[bot] avatar Oct 15 '24 09:10 github-actions[bot]

This PR is stalled because it has been open for 30 days with no activity.

This PR is stalled because it has been open for 30 days with no activity.

This PR is stalled because it has been open for 30 days with no activity.