OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Bug/sbp cancellation

Open kaushalmahi12 opened this issue 9 months ago • 4 comments

Description

This PR is to address and fix the BUG: https://github.com/opensearch-project/OpenSearch/issues/13295

Changes

  • Refactor SearchBackpressureService to introduce resource wise cancellation when node in duress because of the resource
  • Move all resourceTrackers into a single class
  • Put the logic to calculate whether a resource usage is breaching for a task behind an interface and make it a instance member
  • Add an UT to cover the mentioned bug scenario

New Logic for Cancellation

SBP_cancellation

Related Issues

Resolves #13295

Check List

  • [X] New functionality includes testing.
    • [X] All tests pass
  • [X] New functionality has been documented.
    • [X] New functionality has javadoc added
  • [ ] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • [X] Commits are signed per the DCO using --signoff
  • [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [ ] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

kaushalmahi12 avatar Apr 30 '24 17:04 kaushalmahi12

:x: Gradle check result for aa4fd2b714d381a38fabe67d1089631f83b967d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Apr 30 '24 18:04 github-actions[bot]

:x: Gradle check result for bf11c85337d344b82615d2ae3fd79ab62d83ca43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Apr 30 '24 18:04 github-actions[bot]

:x: Gradle check result for 5bcac55b25445275434d1f78f0b748cf39386897: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Apr 30 '24 19:04 github-actions[bot]

Hi @kaushalmahi12, thank you for submitting this PR. Would you mind also creating a cancellation logic diagram similar to what you've previously done https://github.com/opensearch-project/OpenSearch/issues/13295#issuecomment-2078162354? It would really help us grasp the changes for search backpressure.

ticheng-aws avatar May 01 '24 21:05 ticheng-aws

:x: Gradle check result for 6b1c65815e06ae3451cabf806a03bc630f5f4e85: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 21 '24 18:05 github-actions[bot]

:x: Gradle check result for cd3e65bb300288d23c15116f2a9f6a54a9f83328: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 21 '24 20:05 github-actions[bot]

:x: Gradle check result for 2c3c4bc2674c38c6019a4d2ba51d6666f610af68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 21 '24 20:05 github-actions[bot]

:x: Gradle check result for 7646691b2810cbc000b84886334728ef5dfe6510: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 30 '24 23:05 github-actions[bot]

:x: Gradle check result for 49d9501961c8d17dadb6f28ba9233bce8ac1625d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 03 '24 23:06 github-actions[bot]

:x: Gradle check result for f9e7c5ba55fc4955e676ecd26a7c8f3d2d70954b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 04 '24 00:06 github-actions[bot]

:x: Gradle check result for cd98f5a3f09c860c5e2bf3335cee3e549a2dbef4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 04 '24 20:06 github-actions[bot]

:x: Gradle check result for f3b20f95950acf23c6658f60fbbdc19d7d9f06db: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 12 '24 15:06 github-actions[bot]

:x: Gradle check result for f49011abc0e243015f74b0c313431dec920cfccc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 12 '24 22:06 github-actions[bot]

:x: Gradle check result for 135b4c6099c12f674d6e41c20e67e95dbb23f1bd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 17 '24 20:06 github-actions[bot]

Can you still create a documentation issue to add details about cancellation count stats.

Do you mean explain the task level stats(SearchTask, SearchShardTask) ?

kaushalmahi12 avatar Jun 18 '24 12:06 kaushalmahi12

Can you still create a documentation issue to add details about cancellation count stats.

Do you mean explain the task level stats(SearchTask, SearchShardTask) ?

Difference between cancellation stats at resource tracker level and the top level

sohami avatar Jun 19 '24 05:06 sohami

@kaushalmahi12 Changes LGTM. Can you please fix the conflicts in CHANGELOG.md and resolve all the workflow failures

sohami avatar Jun 19 '24 05:06 sohami

Created this issue on documentation repo: https://github.com/opensearch-project/documentation-website/issues/7409

kaushalmahi12 avatar Jun 19 '24 19:06 kaushalmahi12

:x: Gradle check result for 0c7043deca6d4423377971ae9bf8adb2e5b6e16f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 19 '24 21:06 github-actions[bot]

Tests with failures:

  • org.opensearch.index.shard.RemoteIndexShardTests.testSegmentInfosAndReplicationCheckpointTuple
  • org.opensearch.index.shard.RemoteIndexShardTests.classMethod

kaushalmahi12 avatar Jun 20 '24 18:06 kaushalmahi12

:x: Gradle check result for 3c3c64e4a40b5b4715479e63d9e9002c8e6ded7a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 20 '24 18:06 github-actions[bot]

:grey_exclamation: Gradle check result for 214febafc55c7ebf6b292869741c6a03d750f2eb: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions[bot] avatar Jun 20 '24 23:06 github-actions[bot]

:x: Gradle check result for becc022e06cd9216d6a636470d72a7f85d05227f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 20 '24 23:06 github-actions[bot]

:white_check_mark: Gradle check result for bd55e42440c48b5feb5714eb91818135de2b0f40: SUCCESS

github-actions[bot] avatar Jun 21 '24 14:06 github-actions[bot]

:white_check_mark: Gradle check result for 0e38dee0d6c27ae12ca3175b1b75dea384126877: SUCCESS

github-actions[bot] avatar Jun 21 '24 14:06 github-actions[bot]

❌ Gradle check result for becc022: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Failure related to IndicesRequestCacheIT timeout which was recently fixed by https://github.com/opensearch-project/OpenSearch/pull/14369

sohami avatar Jun 21 '24 16:06 sohami