OpenSearch Add a dynamic setting to change skip_cache_factor and min

Description

[Describe what this change achieves]

Related Issues

Resolves #17736

Check List

[x] Functionality includes testing.
[x] API changes companion pull request created, if applicable.
[ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

May 21 '25 16:05 kkewwei

:white_check_mark: Gradle check result for bb809f70da49ae9fdf7c115c6757b3e988d6afb5: SUCCESS

May 21 '25 18:05 github-actions[bot]

Codecov Report

:x: Patch coverage is 95.74468% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 72.84%. Comparing base (c9eb7fb) to head (8d62e50). :warning: Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...java/org/opensearch/indices/IndicesQueryCache.java	95.34%	0 Missing and 2 partials :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18351      +/-   ##
============================================
- Coverage     73.00%   72.84%   -0.16%     
+ Complexity    69518    69427      -91     
============================================
  Files          5648     5648              
  Lines        319230   319272      +42     
  Branches      46174    46183       +9     
============================================
- Hits         233039   232567     -472     
- Misses        67346    67929     +583     
+ Partials      18845    18776      -69

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 21 '25 18:05 codecov[bot]

:grey_exclamation: Gradle check result for 36c928211035dc2b92ee8f85eb3e212a2dda3389: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

May 23 '25 17:05 github-actions[bot]

:x: Gradle check result for 36c928211035dc2b92ee8f85eb3e212a2dda3389: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 24 '25 03:05 github-actions[bot]

:grey_exclamation: Gradle check result for 36c928211035dc2b92ee8f85eb3e212a2dda3389: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

May 26 '25 07:05 github-actions[bot]

❌ Gradle check result for 36c9282: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

SearchIT.testSearchWithMatrixStats #18129

May 26 '25 07:05 kkewwei

:white_check_mark: Gradle check result for 58e444af6ff4645223988b93d04f2787c1b11d4a: SUCCESS

May 26 '25 12:05 github-actions[bot]

:grey_exclamation: Gradle check result for 43ec8995bf301bdac1d5795ea6cdc68d6619c47e: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

May 28 '25 08:05 github-actions[bot]

@sgup432 please help review in your spare time.

May 28 '25 08:05 kkewwei

:white_check_mark: Gradle check result for 43ec8995bf301bdac1d5795ea6cdc68d6619c47e: SUCCESS

May 28 '25 15:05 github-actions[bot]

:white_check_mark: Gradle check result for 404881875177de98691b3b3f1447e0f0a943aa1a: SUCCESS

May 28 '25 16:05 github-actions[bot]

:x: Gradle check result for 12f419282602ea09a47b157887095a79811b04be: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 30 '25 02:05 github-actions[bot]

@sandeshkr419 Can you also help review in your spare time. If there is no problem, I will merge the pr.

May 30 '25 12:05 kkewwei

:white_check_mark: Gradle check result for 12f419282602ea09a47b157887095a79811b04be: SUCCESS

May 30 '25 13:05 github-actions[bot]

:x: Gradle check result for b2f961d4b86e0f5857edb08d7f7bd3dfb43834ac: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Jun 06 '25 09:06 github-actions[bot]

:white_check_mark: Gradle check result for b2f961d4b86e0f5857edb08d7f7bd3dfb43834ac: SUCCESS

Jun 06 '25 10:06 github-actions[bot]

@cwperks @andrross @msfroh I'm so sorry to bother you, can you help merge the PR in your spare time. Just update the conflicted CHANGELOG.md.

Jun 11 '25 13:06 kkewwei

:x: Gradle check result for 414d63080ff2d3ce8553d5fd718166a9aadbb650: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Jun 11 '25 14:06 github-actions[bot]

:x: Gradle check result for 0404661dc9674c7fdddd69e7ab2b4bf620d6ad79: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Jun 11 '25 14:06 github-actions[bot]

❌ Gradle check result for 0404661: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

InternalEngineTests.testNewChangesSnapshotWithDerivedSource #18485 InternalEngineTests.testNewChangesSnapshotWithDeleteAndUpdate #18485 ExampleSystemIngestProcessorClientYamlTestSuiteIT.test #18484

Jun 11 '25 15:06 kkewwei

Hey -- sorry @kkewwei for not noticing this one earlier. Overall, I like exposing these parameters.

I guess one question I have is whether it makes more sense to handle this as a cluster setting or an index setting? In particular, if it's a dynamic cluster setting, we would need to recreate every cache for every local shard.

I'm also kind of thinking that there are cases where you would want different configurations for different indexes. For example, you can probably cache more aggressively on a read-only shard versus a shard that will still see updates.

What do you think?

Jun 11 '25 20:06 msfroh

@msfroh I can probably answer some of the questions

In particular, if it's a dynamic cluster setting, we would need to recreate every cache for every local shard.

I am not sure if I understand this correctly. This cache is a node level cache which caches segment level docId set. So we don't have to recreate it for every shard if the setting changes, this cache is only initialized once during node bootup. Changing these dynamically would change the caching policies for the further incoming queries.

I guess one question I have is whether it makes more sense to handle this as a cluster setting or an index setting?

I believe I was the one who initially suggested setting these configurations at the cluster level to keep things simple. For context, the query cache includes other nuanced policies—such as only caching segments with more than 10k docs, and avoiding caching non-filter queries etc—so adjusting just the settings in this PR might not always lead to noticeable effects. So it might be difficult for many users to understand all these nuances and adjust the settings appropriately at the index level. That’s why I felt that modifying or relaxing the cluster-level settings would be a simpler and more effective approach overall if someone wants a higher usage of cache.

Jun 11 '25 21:06 sgup432

I am not sure if I understand this correctly. This cache is a node level cache which caches segment level docId set. So we don't have to recreate it for every shard if the setting changes, this cache is only initialized once during node bootup. Changing these dynamically would change the caching policies for the further incoming queries.

Sorry -- you're correct. It just recreates the UsageTrackingQueryCachingPolicy. But the QueryCachingPolicy is managed at the shard level.

Maybe we could do both? You could have a cluster-wide setting for the default parameters used by new indexes, and then an index-level setting that's applied dynamically? Of course, that's kind of what index templates are for.

As is, I think modifying these parameters will be a very expert-level thing. I'm not convinced that we need to make it easy.

Jun 11 '25 22:06 msfroh

Maybe we could do both?

Having both might make sense. But again seems like an over-kill to me especially for such nuanced expert level caching policy. Better to go with either of these setting, instead of both IMO.

And I believe you mean keeping both cluster and index setting dynamic? As keeping anyone of these non-dynamic would not make sense as the original change in lucene for intended for that.

As is, I think modifying these parameters will be a very expert-level thing.

Yeah pretty expert-expert level for sure! Also to enable this setting(if done at the index level), one would usually check(atleast that is how I would do) index level cache stats, and then accordingly proceed to relax the limits for desired indices, which in itself might be cumbersome compared to just checking cluster cache level stats as a whole and accordingly relax it(cluster level setting) just once.

Jun 11 '25 22:06 sgup432

I don't have a clear opinion on increasing the cluster or index setting.

Here’s our production practice: when observing a notably low querycache, we directly adjust the cluster setting. There are also instances where the querycache is full. In such cases, adjusting setting for importment index can caching more.

Increasing index granularity parameter seems shown effectiveness in this scenario, as the changes are minimal, if we should support both at first step?

Jun 12 '25 00:06 kkewwei

Here’s our production practice: when observing a notably low querycache, we directly adjust the cluster setting

Yeah thats how I was thinking to use it as well.

There are also instances where the querycache is full. In such cases, adjusting setting for importment index can caching more.

Makes sense.

I am okay with having both.

But personally would still prefer with adding cluster level setting as a start, and if needed, we can add index level setting later.

@msfroh can weigh on this as well?

Jun 12 '25 01:06 sgup432

:white_check_mark: Gradle check result for 0b1f6af2585e4692ead5648cc0e03eafdb7f474b: SUCCESS

Jun 12 '25 14:06 github-actions[bot]

@kkewwei I think we can probably merge this change as a start as everyone agrees to have a cluster setting. Maybe we can address the index level setting later if needed as a separate PR?

Jun 14 '25 01:06 sgup432

@kkewwei I think we can probably merge this change as a start as everyone agrees to have a cluster setting. Maybe we can address the index level setting later if needed as a separate PR?

Of course, merge button is grey on my side due to failing merge requirements.

Jun 14 '25 07:06 kkewwei

:white_check_mark: Gradle check result for 8259aea90558f9a4c911f971503ec7abb3662021: SUCCESS

Jun 20 '25 11:06 github-actions[bot]

OpenSearch
OpenSearch copied to clipboard

Add a dynamic setting to change skip_cache_factor and min_frequency for querycache

Description

Related Issues

Check List

Codecov Report

OpenSearch OpenSearch copied to clipboard

Add a dynamic setting to change skip_cache_factor and min_frequency for querycache

Description

Related Issues

Check List

Codecov Report

OpenSearch
OpenSearch copied to clipboard