OpenSearch
OpenSearch copied to clipboard
Add a dynamic setting to change skip_cache_factor and min_frequency for querycache
Description
[Describe what this change achieves]
Related Issues
Resolves #17736
Check List
- [x] Functionality includes testing.
- [x] API changes companion pull request created, if applicable.
- [ ] Public documentation issue/PR created, if applicable.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
:white_check_mark: Gradle check result for bb809f70da49ae9fdf7c115c6757b3e988d6afb5: SUCCESS
Codecov Report
:x: Patch coverage is 95.74468% with 2 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 72.84%. Comparing base (c9eb7fb) to head (8d62e50).
:warning: Report is 2 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...java/org/opensearch/indices/IndicesQueryCache.java | 95.34% | 0 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #18351 +/- ##
============================================
- Coverage 73.00% 72.84% -0.16%
+ Complexity 69518 69427 -91
============================================
Files 5648 5648
Lines 319230 319272 +42
Branches 46174 46183 +9
============================================
- Hits 233039 232567 -472
- Misses 67346 67929 +583
+ Partials 18845 18776 -69
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:grey_exclamation: Gradle check result for 36c928211035dc2b92ee8f85eb3e212a2dda3389: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
:x: Gradle check result for 36c928211035dc2b92ee8f85eb3e212a2dda3389: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:grey_exclamation: Gradle check result for 36c928211035dc2b92ee8f85eb3e212a2dda3389: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
❌ Gradle check result for 36c9282: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
SearchIT.testSearchWithMatrixStats #18129
:white_check_mark: Gradle check result for 58e444af6ff4645223988b93d04f2787c1b11d4a: SUCCESS
:grey_exclamation: Gradle check result for 43ec8995bf301bdac1d5795ea6cdc68d6619c47e: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
@sgup432 please help review in your spare time.
:white_check_mark: Gradle check result for 43ec8995bf301bdac1d5795ea6cdc68d6619c47e: SUCCESS
:white_check_mark: Gradle check result for 404881875177de98691b3b3f1447e0f0a943aa1a: SUCCESS
:x: Gradle check result for 12f419282602ea09a47b157887095a79811b04be: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@sandeshkr419 Can you also help review in your spare time. If there is no problem, I will merge the pr.
:white_check_mark: Gradle check result for 12f419282602ea09a47b157887095a79811b04be: SUCCESS
:x: Gradle check result for b2f961d4b86e0f5857edb08d7f7bd3dfb43834ac: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for b2f961d4b86e0f5857edb08d7f7bd3dfb43834ac: SUCCESS
@cwperks @andrross @msfroh I'm so sorry to bother you, can you help merge the PR in your spare time. Just update the conflicted CHANGELOG.md.
:x: Gradle check result for 414d63080ff2d3ce8553d5fd718166a9aadbb650: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 0404661dc9674c7fdddd69e7ab2b4bf620d6ad79: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
❌ Gradle check result for 0404661: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
InternalEngineTests.testNewChangesSnapshotWithDerivedSource #18485 InternalEngineTests.testNewChangesSnapshotWithDeleteAndUpdate #18485 ExampleSystemIngestProcessorClientYamlTestSuiteIT.test #18484
Hey -- sorry @kkewwei for not noticing this one earlier. Overall, I like exposing these parameters.
I guess one question I have is whether it makes more sense to handle this as a cluster setting or an index setting? In particular, if it's a dynamic cluster setting, we would need to recreate every cache for every local shard.
I'm also kind of thinking that there are cases where you would want different configurations for different indexes. For example, you can probably cache more aggressively on a read-only shard versus a shard that will still see updates.
What do you think?
@msfroh I can probably answer some of the questions
In particular, if it's a dynamic cluster setting, we would need to recreate every cache for every local shard.
I am not sure if I understand this correctly. This cache is a node level cache which caches segment level docId set. So we don't have to recreate it for every shard if the setting changes, this cache is only initialized once during node bootup. Changing these dynamically would change the caching policies for the further incoming queries.
I guess one question I have is whether it makes more sense to handle this as a cluster setting or an index setting?
I believe I was the one who initially suggested setting these configurations at the cluster level to keep things simple. For context, the query cache includes other nuanced policies—such as only caching segments with more than 10k docs, and avoiding caching non-filter queries etc—so adjusting just the settings in this PR might not always lead to noticeable effects. So it might be difficult for many users to understand all these nuances and adjust the settings appropriately at the index level. That’s why I felt that modifying or relaxing the cluster-level settings would be a simpler and more effective approach overall if someone wants a higher usage of cache.
I am not sure if I understand this correctly. This cache is a node level cache which caches segment level docId set. So we don't have to recreate it for every shard if the setting changes, this cache is only initialized once during node bootup. Changing these dynamically would change the caching policies for the further incoming queries.
Sorry -- you're correct. It just recreates the UsageTrackingQueryCachingPolicy. But the QueryCachingPolicy is managed at the shard level.
Maybe we could do both? You could have a cluster-wide setting for the default parameters used by new indexes, and then an index-level setting that's applied dynamically? Of course, that's kind of what index templates are for.
As is, I think modifying these parameters will be a very expert-level thing. I'm not convinced that we need to make it easy.
Maybe we could do both?
Having both might make sense. But again seems like an over-kill to me especially for such nuanced expert level caching policy. Better to go with either of these setting, instead of both IMO.
And I believe you mean keeping both cluster and index setting dynamic? As keeping anyone of these non-dynamic would not make sense as the original change in lucene for intended for that.
As is, I think modifying these parameters will be a very expert-level thing.
Yeah pretty expert-expert level for sure! Also to enable this setting(if done at the index level), one would usually check(atleast that is how I would do) index level cache stats, and then accordingly proceed to relax the limits for desired indices, which in itself might be cumbersome compared to just checking cluster cache level stats as a whole and accordingly relax it(cluster level setting) just once.
I don't have a clear opinion on increasing the cluster or index setting.
Here’s our production practice: when observing a notably low querycache, we directly adjust the cluster setting. There are also instances where the querycache is full. In such cases, adjusting setting for importment index can caching more.
Increasing index granularity parameter seems shown effectiveness in this scenario, as the changes are minimal, if we should support both at first step?
Here’s our production practice: when observing a notably low querycache, we directly adjust the cluster setting
Yeah thats how I was thinking to use it as well.
There are also instances where the querycache is full. In such cases, adjusting setting for importment index can caching more.
Makes sense.
I am okay with having both.
But personally would still prefer with adding cluster level setting as a start, and if needed, we can add index level setting later.
@msfroh can weigh on this as well?
:white_check_mark: Gradle check result for 0b1f6af2585e4692ead5648cc0e03eafdb7f474b: SUCCESS
@kkewwei I think we can probably merge this change as a start as everyone agrees to have a cluster setting. Maybe we can address the index level setting later if needed as a separate PR?
@kkewwei I think we can probably merge this change as a start as everyone agrees to have a cluster setting. Maybe we can address the index level setting later if needed as a separate PR?
Of course, merge button is grey on my side due to failing merge requirements.
:white_check_mark: Gradle check result for 8259aea90558f9a4c911f971503ec7abb3662021: SUCCESS