OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Query-level resource usages tracking

Open ansjcy opened this issue 10 months ago • 7 comments

Description

  • Instrument resource usages before a task finishes on a data node, more specifically, get resource usages at the last step (serialization) before a phase response is sent from query/fetch/.. phase.
  • Piggyback the resource usages data with phase results.
  • Gather resource usages for all shard search tasks on coordinator node (in query insights plugin) to get the query-level resource usage.

Related Issues

Resolves https://github.com/opensearch-project/OpenSearch/issues/12399

benchmark tests

Did extensive benchmark tests, merged the tests results by calculating average on multiple runs, and here are the test results:

baseline-resulsts.txt feature-ressults.txt

I don't see significant impact on search latency with this change.

Check List

  • [ ] New functionality includes testing.
    • [ ] All tests pass
  • [ ] New functionality has been documented.
    • [ ] New functionality has javadoc added
  • [ ] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • [ ] Commits are signed per the DCO using --signoff
  • [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [ ] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

ansjcy avatar Apr 12 '24 01:04 ansjcy

:x: Gradle check result for 32f9a756e2bd1eaa026b5aa736e2881e6d7e16d5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Apr 12 '24 01:04 github-actions[bot]

Compatibility status:

Checks if related components are compatible with change 32f9a75

Incompatible components

Incompatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/common-utils.git]

github-actions[bot] avatar Apr 12 '24 02:04 github-actions[bot]

what version of OpenSearch are we planning this for? cc: @ansjcy

rramachand21 avatar Apr 16 '24 17:04 rramachand21

Hey @rramachand21, we are targeting 2.15 for this change.

ansjcy avatar Apr 16 '24 20:04 ansjcy

:x: Gradle check result for 3987580f00d4e77797e87b2c627957a44fba2e08: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Apr 16 '24 21:04 github-actions[bot]

:x: Gradle check result for 27813ba80d95718c1bfda0a6348a46a2e3992345: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Apr 16 '24 21:04 github-actions[bot]

:x: Gradle check result for 32c9b49bdb3a1e4448333f04d4f673ffd5a716ac: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Apr 16 '24 23:04 github-actions[bot]

Did extensive benchmark tests, merged the tests results by calculating average on multiple runs, and here are the test results: I don't see significant impact on search latency with this change.

@ansjcy - I went through the benchmark numbers and some of them do show >5% latency increase in p99 latency. For example - keyword-terms (from 117->126), keyword-terms-low-cardinality (from 110->120), range (from 130->137). Maybe they are one off run variations, but will be good to do few runs and report the results side by side. Also, did we notice any regressions in the CPU/JVM usage?

jainankitk avatar May 15 '24 00:05 jainankitk

:x: Gradle check result for b68bebd800b1b2f5a0d274666eeca8d3510d7066: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 17 '24 22:05 github-actions[bot]

:x: Gradle check result for 7c450c05baa952241327bcb0f9fe8698d95e8551: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 17 '24 22:05 github-actions[bot]

:x: Gradle check result for b89678c5a1838919ed54edae1a49c9f751ebc51f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 17 '24 22:05 github-actions[bot]

:x: Gradle check result for c11efc04d59b907d603e297b314f3f2ee50f08d3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 20 '24 21:05 github-actions[bot]

:x: Gradle check result for 8bc0951e29f3b633f83e575339a56572d7f3ae02: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 20 '24 23:05 github-actions[bot]

:white_check_mark: Gradle check result for 3fe476721af3466122313b5a50b2f9825c2efca9: SUCCESS

github-actions[bot] avatar May 22 '24 20:05 github-actions[bot]

Codecov Report

Attention: Patch coverage is 85.45455% with 24 lines in your changes missing coverage. Please review.

Project coverage is 71.64%. Comparing base (b15cb0c) to head (69629ff). Report is 367 commits behind head on main.

Files Patch % Lines
.../opensearch/tasks/TaskResourceTrackingService.java 69.76% 8 Missing and 5 partials :warning:
...h/core/tasks/resourcetracker/TaskResourceInfo.java 93.05% 1 Missing and 4 partials :warning:
...erver/src/main/java/org/opensearch/tasks/Task.java 57.14% 1 Missing and 2 partials :warning:
...action/search/SearchRequestOperationsListener.java 75.00% 2 Missing :warning:
...opensearch/action/search/SearchRequestContext.java 88.88% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13172      +/-   ##
============================================
+ Coverage     71.42%   71.64%   +0.22%     
- Complexity    59978    61587    +1609     
============================================
  Files          4985     5082      +97     
  Lines        282275   289231    +6956     
  Branches      40946    41852     +906     
============================================
+ Hits         201603   207210    +5607     
- Misses        63999    64957     +958     
- Partials      16673    17064     +391     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 22 '24 20:05 codecov[bot]

@jainankitk Thanks for the feedback!

Maybe they are one off run variations, but will be good to do few runs and report the results side by side. Also, did we notice any regressions in the CPU/JVM usage?

I didn't notice any regressions in CPU/JVM usage, but I think your point is valid, let me do more runs to make sure the latency increase we saw are one off run variations.

ansjcy avatar May 29 '24 22:05 ansjcy

:x: Gradle check result for 3a6e51f33e538c1dd1b647166dba953256ac5535: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 30 '24 00:05 github-actions[bot]

@msfroh @jainankitk @kaushalmahi12 Thanks folks for your review! in the latest code, I added the resource usage instrumentation for failed phases as well! Let me know if you have any other concerns. Thanks a lot.

ansjcy avatar May 30 '24 22:05 ansjcy

:x: Gradle check result for 8c9109a01b5cd50a3b6f4b686adeba396f4ff9b4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar May 30 '24 23:05 github-actions[bot]

:x: Gradle check result for e45809b3faf8860b28cbf53b6c81a255d67710a2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 04 '24 21:06 github-actions[bot]

:x: Gradle check result for 5bf3df42dc8454f19fcb1c331d6eb2007befd957: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 04 '24 22:06 github-actions[bot]

:x: Gradle check result for 8bab7d03ea86a1339bb969d29fc8d7d083533335: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 05 '24 00:06 github-actions[bot]

:x: Gradle check result for 1d2f80414c56fe0267e8f47c16f6b54bc8d29b70: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 05 '24 00:06 github-actions[bot]

:x: Gradle check result for c21171a9cc07e8c5e8d13c55dacafb5ba4ebc343: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 05 '24 19:06 github-actions[bot]

:x: Gradle check result for 6337af529860d55a86250afa295bd730cba09aff: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 05 '24 23:06 github-actions[bot]

:x: Gradle check result for 6337af529860d55a86250afa295bd730cba09aff: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 06 '24 00:06 github-actions[bot]

:x: Gradle check result for 0f2a765fadbcb211e84be340c1fe79b321865e3d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 06 '24 00:06 github-actions[bot]

Added more tests in TaskResourceTrackingService as well. This PR should be good to merge now.

ansjcy avatar Jun 06 '24 02:06 ansjcy

:x: Gradle check result for 2205ea69ac0a3b72452470836dcd6c377fe8203e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 06 '24 02:06 github-actions[bot]

:x: Gradle check result for 2205ea69ac0a3b72452470836dcd6c377fe8203e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 06 '24 03:06 github-actions[bot]