OpenSearch
OpenSearch copied to clipboard
Query-level resource usages tracking
Description
- Instrument resource usages before a task finishes on a data node, more specifically, get resource usages at the last step (serialization) before a phase response is sent from query/fetch/.. phase.
- Piggyback the resource usages data with phase results.
- Gather resource usages for all shard search tasks on coordinator node (in query insights plugin) to get the query-level resource usage.
Related Issues
Resolves https://github.com/opensearch-project/OpenSearch/issues/12399
benchmark tests
Did extensive benchmark tests, merged the tests results by calculating average on multiple runs, and here are the test results:
baseline-resulsts.txt feature-ressults.txt
I don't see significant impact on search latency with this change.
Check List
- [ ] New functionality includes testing.
- [ ] All tests pass
- [ ] New functionality has been documented.
- [ ] New functionality has javadoc added
- [ ] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
- [ ] Commits are signed per the DCO using --signoff
- [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
- [ ] Public documentation issue/PR created
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
:x: Gradle check result for 32f9a756e2bd1eaa026b5aa736e2881e6d7e16d5: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Compatibility status:
Checks if related components are compatible with change 32f9a75
Incompatible components
Incompatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git]
Skipped components
Compatible components
Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/common-utils.git]
what version of OpenSearch are we planning this for? cc: @ansjcy
Hey @rramachand21, we are targeting 2.15 for this change.
:x: Gradle check result for 3987580f00d4e77797e87b2c627957a44fba2e08: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 27813ba80d95718c1bfda0a6348a46a2e3992345: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 32c9b49bdb3a1e4448333f04d4f673ffd5a716ac: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Did extensive benchmark tests, merged the tests results by calculating average on multiple runs, and here are the test results: I don't see significant impact on search latency with this change.
@ansjcy - I went through the benchmark numbers and some of them do show >5% latency increase in p99 latency. For example - keyword-terms (from 117->126), keyword-terms-low-cardinality (from 110->120), range (from 130->137). Maybe they are one off run variations, but will be good to do few runs and report the results side by side. Also, did we notice any regressions in the CPU/JVM usage?
:x: Gradle check result for b68bebd800b1b2f5a0d274666eeca8d3510d7066: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 7c450c05baa952241327bcb0f9fe8698d95e8551: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for b89678c5a1838919ed54edae1a49c9f751ebc51f: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c11efc04d59b907d603e297b314f3f2ee50f08d3: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 8bc0951e29f3b633f83e575339a56572d7f3ae02: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for 3fe476721af3466122313b5a50b2f9825c2efca9: SUCCESS
Codecov Report
Attention: Patch coverage is 85.45455%
with 24 lines
in your changes missing coverage. Please review.
Project coverage is 71.64%. Comparing base (
b15cb0c
) to head (69629ff
). Report is 367 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #13172 +/- ##
============================================
+ Coverage 71.42% 71.64% +0.22%
- Complexity 59978 61587 +1609
============================================
Files 4985 5082 +97
Lines 282275 289231 +6956
Branches 40946 41852 +906
============================================
+ Hits 201603 207210 +5607
- Misses 63999 64957 +958
- Partials 16673 17064 +391
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@jainankitk Thanks for the feedback!
Maybe they are one off run variations, but will be good to do few runs and report the results side by side. Also, did we notice any regressions in the CPU/JVM usage?
I didn't notice any regressions in CPU/JVM usage, but I think your point is valid, let me do more runs to make sure the latency increase we saw are one off run variations.
:x: Gradle check result for 3a6e51f33e538c1dd1b647166dba953256ac5535: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@msfroh @jainankitk @kaushalmahi12 Thanks folks for your review! in the latest code, I added the resource usage instrumentation for failed phases as well! Let me know if you have any other concerns. Thanks a lot.
:x: Gradle check result for 8c9109a01b5cd50a3b6f4b686adeba396f4ff9b4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for e45809b3faf8860b28cbf53b6c81a255d67710a2: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 5bf3df42dc8454f19fcb1c331d6eb2007befd957: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 8bab7d03ea86a1339bb969d29fc8d7d083533335: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 1d2f80414c56fe0267e8f47c16f6b54bc8d29b70: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c21171a9cc07e8c5e8d13c55dacafb5ba4ebc343: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 6337af529860d55a86250afa295bd730cba09aff: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 6337af529860d55a86250afa295bd730cba09aff: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 0f2a765fadbcb211e84be340c1fe79b321865e3d: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Added more tests in TaskResourceTrackingService as well. This PR should be good to merge now.
:x: Gradle check result for 2205ea69ac0a3b72452470836dcd6c377fe8203e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 2205ea69ac0a3b72452470836dcd6c377fe8203e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?