OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i…

Open gbbafna opened this issue 1 year ago • 11 comments

…t to a bounded size

Description

  1. Change remote purge threadpool to fixed threadpool
  2. Handle Runtime Exceptions in unhandled Translog deletion paths

Related Issues

Resolves [#12253]

Check List

  • [x] New functionality includes testing.
    • [x] All tests pass
  • [x] New functionality has been documented.
    • [x] New functionality has javadoc added
  • [x] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • [x] Commits are signed per the DCO using --signoff
  • [ ] ~Commit changes are listed out in CHANGELOG.md file (See: Changelog)~
  • [ ] ~Public documentation issue/PR created~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

gbbafna avatar Feb 08 '24 05:02 gbbafna

:x: Gradle check result for 9882c0cb7f322fbce70355b5007d38ab5f851b23: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Feb 08 '24 05:02 github-actions[bot]

Compatibility status:

Checks if related components are compatible with change 699b571

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

github-actions[bot] avatar Feb 08 '24 05:02 github-actions[bot]

:x: Gradle check result for 7fd858198dbb5f916ce145601e08e36042178aa3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Feb 08 '24 06:02 github-actions[bot]

:x: Gradle check result for 7fd858198dbb5f916ce145601e08e36042178aa3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Feb 08 '24 07:02 github-actions[bot]

:x: Gradle check result for 3d08b4c4d7de3b667c172355626bd6c8496fdcf4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Feb 08 '24 11:02 github-actions[bot]

Making this pool bounded creates different problems - namely these tasks fail - how will they be retried, what if they aren't retried?

If there is something generating this large number of tasks, can it regulate itself by inspecting the pool since it seems specifically associated.

Good point . It doesn't create different problem . That problem already exists today and will be solved via https://github.com/opensearch-project/OpenSearch/issues/8469 . Even if we submit the task and it fails, today there is no retry mechanism .

gbbafna avatar Feb 09 '24 04:02 gbbafna

:x: Gradle check result for 50dcbf291fdab59dd9081a60cf1316cf636978fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Feb 09 '24 06:02 github-actions[bot]

:x: Gradle check result for 699b571c014a68f2f9b2ee6ca44c3798562c7743: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Feb 09 '24 10:02 github-actions[bot]

I'm not convinced we should move forward with this pull request, it looks like a small bandaid on widely impacting scenario, please help me better understand why this change shouldn't be built upon to better handle the root cause.

Key areas of concern:

  • Not signaling task queue rejection of tasks - exception as flow of control is expensive and undocumented exceptional behavior is going to lead to inconsistent behavior around task management. https://github.com/opensearch-project/OpenSearch/pull/12247#discussion_r1484563970
  • Doesn't address the root cause of generating millions of what become unactionable tasks.

peternied avatar Feb 09 '24 16:02 peternied

I'm not convinced we should move forward with this pull request, it looks like a small bandaid on widely impacting scenario, please help me better understand why this change shouldn't be built upon to better handle the root cause.

Key areas of concern:

* Not signaling task queue rejection of tasks - exception as flow of control is expensive and undocumented exceptional behavior is going to lead to inconsistent behavior around task management. [[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247 (comment)](https://github.com/opensearch-project/OpenSearch/pull/12247#discussion_r1484563970)

* Doesn't address the root cause of generating millions of what become unactionable tasks.

We do want to solve this problem in the right way. However that needs to be designed and implemented and is not a quick fix kind of work . In the meantime, this solution prevents the queue to become unmanageable in the first place . We need this life saving first-aid for now, till we do the surgery.

gbbafna avatar Feb 14 '24 06:02 gbbafna

@gbbafna Thanks for your thoughts. In the issue @harishbhakuni mentioned an approach to execute the deletes in batches rather than individual tasks. This approach sounds easier to deliver in the short term and would not introduce risk around exception handling. What do you think about pivoting in that direction?

Yes, we would need that change as well as a first level of defense . But this change will be a second level of defense as the problem can still happen . Even the batch deletes can take a lot of time to execute. The chances of threadpool size going very high would still be there, though it be reduced . Making it bounded will help us get upfront rejections, which we can use to reject the snapshot deletion as well . @harishbhakuni will be making that change as well, which relies on using a FixedThreadPool .

gbbafna avatar Feb 15 '24 04:02 gbbafna

This PR is stalled because it has been open for 30 days with no activity.

Closing this issue as we are relying on @harishbhakuni 's fix https://github.com/opensearch-project/OpenSearch/pull/12319 . Will revisit later if we see any issues .

gbbafna avatar Apr 23 '24 04:04 gbbafna