OpenSearch
OpenSearch copied to clipboard
Bug/sbp cancellation
Description
This PR is to address and fix the BUG: https://github.com/opensearch-project/OpenSearch/issues/13295
Changes
- Refactor SearchBackpressureService to introduce resource wise cancellation when node in duress because of the resource
- Move all resourceTrackers into a single class
- Put the logic to calculate whether a resource usage is breaching for a task behind an interface and make it a instance member
- Add an UT to cover the mentioned bug scenario
New Logic for Cancellation
Related Issues
Resolves #13295
Check List
- [X] New functionality includes testing.
- [X] All tests pass
- [X] New functionality has been documented.
- [X] New functionality has javadoc added
- [ ] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
- [X] Commits are signed per the DCO using --signoff
- [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
- [ ] Public documentation issue/PR created
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
:x: Gradle check result for aa4fd2b714d381a38fabe67d1089631f83b967d0: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for bf11c85337d344b82615d2ae3fd79ab62d83ca43: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 5bcac55b25445275434d1f78f0b748cf39386897: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Hi @kaushalmahi12, thank you for submitting this PR. Would you mind also creating a cancellation logic diagram similar to what you've previously done https://github.com/opensearch-project/OpenSearch/issues/13295#issuecomment-2078162354? It would really help us grasp the changes for search backpressure.
:x: Gradle check result for 6b1c65815e06ae3451cabf806a03bc630f5f4e85: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for cd3e65bb300288d23c15116f2a9f6a54a9f83328: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 2c3c4bc2674c38c6019a4d2ba51d6666f610af68: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 7646691b2810cbc000b84886334728ef5dfe6510: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 49d9501961c8d17dadb6f28ba9233bce8ac1625d: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for f9e7c5ba55fc4955e676ecd26a7c8f3d2d70954b: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for cd98f5a3f09c860c5e2bf3335cee3e549a2dbef4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for f3b20f95950acf23c6658f60fbbdc19d7d9f06db: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for f49011abc0e243015f74b0c313431dec920cfccc: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 135b4c6099c12f674d6e41c20e67e95dbb23f1bd: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Can you still create a documentation issue to add details about cancellation count stats.
Do you mean explain the task level stats(SearchTask, SearchShardTask) ?
Can you still create a documentation issue to add details about cancellation count stats.
Do you mean explain the task level stats(SearchTask, SearchShardTask) ?
Difference between cancellation stats at resource tracker level and the top level
@kaushalmahi12 Changes LGTM. Can you please fix the conflicts in CHANGELOG.md and resolve all the workflow failures
Created this issue on documentation repo: https://github.com/opensearch-project/documentation-website/issues/7409
:x: Gradle check result for 0c7043deca6d4423377971ae9bf8adb2e5b6e16f: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Tests with failures:
- org.opensearch.index.shard.RemoteIndexShardTests.testSegmentInfosAndReplicationCheckpointTuple
- org.opensearch.index.shard.RemoteIndexShardTests.classMethod
:x: Gradle check result for 3c3c64e4a40b5b4715479e63d9e9002c8e6ded7a: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:grey_exclamation: Gradle check result for 214febafc55c7ebf6b292869741c6a03d750f2eb: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
:x: Gradle check result for becc022e06cd9216d6a636470d72a7f85d05227f: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for bd55e42440c48b5feb5714eb91818135de2b0f40: SUCCESS
:white_check_mark: Gradle check result for 0e38dee0d6c27ae12ca3175b1b75dea384126877: SUCCESS
❌ Gradle check result for becc022: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Failure related to IndicesRequestCacheIT
timeout which was recently fixed by https://github.com/opensearch-project/OpenSearch/pull/14369