[BUG] Updating some search backpressure settings crash the cluster
Describe the bug
This issue comes from the forum: https://forum.opensearch.org/t/unable-to-start-opensearch-loop-failed-to-apply-settings-and-rate-must-be-greater-than-zero/20908.
When update the setting search_backpressure.cancellation_burst(deprecated), search_backpressure.search_task.cancellation_burst or search_backpressure.search_shard_task.cancellation_burst to an non-default value, the cluster fails to apply the settings and throws org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero, the cluster gets stuck in it and all operations on the master node fail, even restarting the cluster doesn't work.
Related component
Cluster Manager
To Reproduce
- Update setting
PUT _cluster/settings
{
"persistent": {
"search_backpressure.search_task.cancellation_burst": 11
}
}
, to avoid making your cluster never come back even after restarting it, you can change persistent to transient.
Expected behavior
Fix the bug.
Additional Details
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context Add any other context about the problem here.
@kaushalmahi12 - Can you look into this? While @gaobinlong already has PR for cancellation_burst setting, let us validate other settings for Search Backpressure and Workload Management
[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?
@gaobinlong mind please updating the documentation for these settings [1], thank you
[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md
@gaobinlong mind please updating the documentation for these settings [1], thank you
[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md
Thanks @reta, I've created a documentation PR for it: https://github.com/opensearch-project/documentation-website/pull/8555.
[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?
Here're the stacktraces:
[2024-08-20T09:22:27,818][INFO ][o.o.c.s.ClusterApplierService] [opensearch-master-data-node-33] cluster-manager node changed {previous [{opensearch-master-data-node-33}{yOd-Z9CZR82IUxPxee3KrQ}{ik8U02GyQfSyYwQd_JqNNw}{172.24.0.33}{172.24.0.33:9300}{dimr}{shard_indexing_pressure_enabled=true}], current []}, term: 21261, version: 96514, reason: becoming candidate: clusterApplier#onNewClusterState
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.metadata.perf_analyzer.state] from [] to [0]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [indices.recovery.max_bytes_per_sec] from [41943040b] to [500mb]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [plugins.index_state_management.template_migration.control] from [0] to [-1]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [search_backpressure.cancellation_burst] from [10.0] to [10]
[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero
at org.opensearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:209) ~[opensearch-core-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:275) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.setCancellationBurst(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.settings.Setting$Updater.apply(Setting.java:1254) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:696) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:232) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:558) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-2.12.0.jar:2.12.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalArgumentException: rate must be greater than zero
at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:52) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:47) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.SearchBackpressureState.onRateChanged(SearchBackpressureState.java:95) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.SearchBackpressureState.onBurstChanged(SearchBackpressureState.java:101) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.lambda$setCancellationBurst$2(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:269) ~[opensearch-2.12.0.jar:2.12.0]
... 13 more
, the cluster_manger is not able to apply the invalid settings because the cluster state is corrupt, after execute ./bin/opensearch-node remove-settings search_backpressure.cancellation_burst, to remove the invalid settings from the cluster state, then the cluster comes back.