OpenSearch [BUG] Updating some search backpressure settings crash the cluster

Describe the bug

This issue comes from the forum: https://forum.opensearch.org/t/unable-to-start-opensearch-loop-failed-to-apply-settings-and-rate-must-be-greater-than-zero/20908.

When update the setting search_backpressure.cancellation_burst(deprecated), search_backpressure.search_task.cancellation_burst or search_backpressure.search_shard_task.cancellation_burst to an non-default value, the cluster fails to apply the settings and throws org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero, the cluster gets stuck in it and all operations on the master node fail, even restarting the cluster doesn't work.

Related component

Cluster Manager

To Reproduce

Update setting

PUT _cluster/settings
{
  "persistent": {
    "search_backpressure.search_task.cancellation_burst": 11
  }
}

, to avoid making your cluster never come back even after restarting it, you can change persistent to transient.

Expected behavior

Fix the bug.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context Add any other context about the problem here.

Aug 29 '24 07:08 gaobinlong

@kaushalmahi12 - Can you look into this? While @gaobinlong already has PR for cancellation_burst setting, let us validate other settings for Search Backpressure and Workload Management

Aug 29 '24 21:08 jainankitk

[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?

Sep 11 '24 06:09 rajiv-kv

@gaobinlong mind please updating the documentation for these settings [1], thank you

[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md

Sep 19 '24 14:09 reta

@gaobinlong mind please updating the documentation for these settings [1], thank you

[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md

Thanks @reta, I've created a documentation PR for it: https://github.com/opensearch-project/documentation-website/pull/8555.

Oct 17 '24 04:10 gaobinlong

[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?

Here're the stacktraces:

[2024-08-20T09:22:27,818][INFO ][o.o.c.s.ClusterApplierService] [opensearch-master-data-node-33] cluster-manager node changed {previous [{opensearch-master-data-node-33}{yOd-Z9CZR82IUxPxee3KrQ}{ik8U02GyQfSyYwQd_JqNNw}{172.24.0.33}{172.24.0.33:9300}{dimr}{shard_indexing_pressure_enabled=true}], current []}, term: 21261, version: 96514, reason: becoming candidate: clusterApplier#onNewClusterState
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.metadata.perf_analyzer.state] from [] to [0]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_bytes_per_sec] from [41943040b] to [500mb]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [plugins.index_state_management.template_migration.control] from [0] to [-1]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [search_backpressure.cancellation_burst] from [10.0] to [10]
[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero
	at org.opensearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:209) ~[opensearch-core-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:275) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.setCancellationBurst(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.Setting$Updater.apply(Setting.java:1254) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:696) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:232) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:558) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-2.12.0.jar:2.12.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalArgumentException: rate must be greater than zero
	at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:52) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:47) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.SearchBackpressureState.onRateChanged(SearchBackpressureState.java:95) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.SearchBackpressureState.onBurstChanged(SearchBackpressureState.java:101) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.lambda$setCancellationBurst$2(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:269) ~[opensearch-2.12.0.jar:2.12.0]
	... 13 more

, the cluster_manger is not able to apply the invalid settings because the cluster state is corrupt, after execute ./bin/opensearch-node remove-settings search_backpressure.cancellation_burst, to remove the invalid settings from the cluster state, then the cluster comes back.

Oct 17 '24 04:10 gaobinlong