OpenSearch
OpenSearch copied to clipboard
[Remote Store] Permit backed futures to prevent timeouts during upload bursts
Description
During burst of uploads happening typically in finalize recovery in cases like shrink, split and force merge where lot of segments or large segments become available for upload, large number of requests get queued up behind connection pool. This either results in timeouts due to failure in acquiring a connection or idle connection timeout where an ongoing request takes too long to read and compute data for upload which is because of high wait time for acquiring a thread in stream reader pool. It is more prevalent in async flow since main thread doesn't wait for the response and everything ends up getting submitted for upload. Both sync and async S3 SDK apis do not have a way today to handle such bursts.
This PR resolves these problems by applying natural backpressure on main thread with the help of backing permits. It also adds retries on future in case of a SDK exception or failure in acquisition of a permit. This means that in case of multi-part upload, a failing part can be independently retried.
Testing
- During post recovery of a 98gb shard on a r7g.medium box after split of nyc_taxis index, I did not observe any IO timeout. Concurrent execution of so workload benchmark also did not produce any timeout error.
- No impact on indexing performance on executing benchmarks on main build and build with this PR.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
- [x] New functionality includes testing.
- [x] All tests pass
- [ ] ~~New functionality has been documented.~~
- [ ] ~~New functionality has javadoc added~~
- [x] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
- [x] Commits are signed per the DCO using --signoff
- [ ] ~~Commit changes are listed out in CHANGELOG.md file (See: Changelog)~~
- [ ] ~~Public documentation issue/PR created~~
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
:x: Gradle check result for d22dbe123c2722d32154d8dd0af083b3093f69d1: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 8005bdfe5ed40624d440f045dee5444ec9b9d271: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Compatibility status:
Checks if related components are compatible with change 7cb04d8
Incompatible components
Skipped components
Compatible components
Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]
:x: Gradle check result for 2fbb44022e1c5b7ffdd60e4da79ec70b7bc122e3: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 06cc6090f09f5dad50d592f3af952230f92fae00: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for bc512008781169c5316e6d6d2a036e44641bc4fe: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 62b1d15568f3a7b5eb8a07df1e7d2a70587ca39c: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c5b551227b786cb1b4bc9fa0fe1c52f650c5f08a: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 808cb65dcb3e07ed7c8f47bb7ee2727c2e4665ec: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 3e58685ac90b0cebfe34d0186e98d1d79957d07d: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@vikasvb90 , please confirm the release target - is it 2.13 or 2.14 ?
@rramachand21 Target is for 2.13 but depends on how review goes.
:x: Gradle check result for cec5768066fd2ea54f6503c596be6b075fa8940e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 74c07eddf257a52661825ed34bdff124d086cddc: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 68b8a3daf4ac9d4b7b516d32e2f70e4876a47150: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 7cb04d8bcadef730aeb731774ce0a5056cc52896: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 31d0e1fd77258e61d1c2e6ba41fa2ad843c83bc7: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for ea8903af061bbb81f77856bf58efcf5b8f488924: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for b967eb7ebab8106772e009c55c1bb7b351801d75: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 0c88d57adb4b62d61cbc49e64e9a79539201bbc4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for 635124d4d90b99de91750a962d6819ce0b7d3b1a: SUCCESS
Codecov Report
Attention: Patch coverage is 71.91011% with 100 lines in your changes are missing coverage. Please review.
Project coverage is 71.66%. Comparing base (
b15cb0c) to head (c394e90). Report is 275 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #12159 +/- ##
============================================
+ Coverage 71.42% 71.66% +0.24%
- Complexity 59978 61159 +1181
============================================
Files 4985 5056 +71
Lines 282275 287445 +5170
Branches 40946 41640 +694
============================================
+ Hits 201603 206009 +4406
- Misses 63999 64374 +375
- Partials 16673 17062 +389
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:white_check_mark: Gradle check result for 1b6033278f0e324cdd67e7d2a11e561f77186868: SUCCESS
:white_check_mark: Gradle check result for a613eb52b4910543eef7bee0be6d1758d438056c: SUCCESS
:white_check_mark: Gradle check result for 8fbc661f6143e73d7bf7847d6f572f0b77c98586: SUCCESS
:white_check_mark: Gradle check result for a518a0881e5483736a931476879aa8ca09f97f6c: SUCCESS
[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12 13]
@vikasvb90 Thanks for opening this. Looking forward to getting this improvement merged
:x: Gradle check result for 85ee3086faf713aed6a6a5ab04b27da2651ccc10: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 377c1700d5d1868e91348fff82051da23ca7e88a: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 4aabe473bb7f8e820a1cd038106a743a4d2696d5: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?