OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Updating some search backpressure settings crash the cluster

Open gaobinlong opened this issue 1 year ago • 5 comments

Describe the bug

This issue comes from the forum: https://forum.opensearch.org/t/unable-to-start-opensearch-loop-failed-to-apply-settings-and-rate-must-be-greater-than-zero/20908.

When update the setting search_backpressure.cancellation_burst(deprecated), search_backpressure.search_task.cancellation_burst or search_backpressure.search_shard_task.cancellation_burst to an non-default value, the cluster fails to apply the settings and throws org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero, the cluster gets stuck in it and all operations on the master node fail, even restarting the cluster doesn't work.

Related component

Cluster Manager

To Reproduce

  1. Update setting
PUT _cluster/settings
{
  "persistent": {
    "search_backpressure.search_task.cancellation_burst": 11
  }
}

, to avoid making your cluster never come back even after restarting it, you can change persistent to transient.

Expected behavior

Fix the bug.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

gaobinlong avatar Aug 29 '24 07:08 gaobinlong

@kaushalmahi12 - Can you look into this? While @gaobinlong already has PR for cancellation_burst setting, let us validate other settings for Search Backpressure and Workload Management

jainankitk avatar Aug 29 '24 21:08 jainankitk

[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?

rajiv-kv avatar Sep 11 '24 06:09 rajiv-kv

@gaobinlong mind please updating the documentation for these settings [1], thank you

[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md

reta avatar Sep 19 '24 14:09 reta

@gaobinlong mind please updating the documentation for these settings [1], thank you

[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md

Thanks @reta, I've created a documentation PR for it: https://github.com/opensearch-project/documentation-website/pull/8555.

gaobinlong avatar Oct 17 '24 04:10 gaobinlong

[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?

Here're the stacktraces:

[2024-08-20T09:22:27,818][INFO ][o.o.c.s.ClusterApplierService] [opensearch-master-data-node-33] cluster-manager node changed {previous [{opensearch-master-data-node-33}{yOd-Z9CZR82IUxPxee3KrQ}{ik8U02GyQfSyYwQd_JqNNw}{172.24.0.33}{172.24.0.33:9300}{dimr}{shard_indexing_pressure_enabled=true}], current []}, term: 21261, version: 96514, reason: becoming candidate: clusterApplier#onNewClusterState
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.metadata.perf_analyzer.state] from [] to [0]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_bytes_per_sec] from [41943040b] to [500mb]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [plugins.index_state_management.template_migration.control] from [0] to [-1]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [search_backpressure.cancellation_burst] from [10.0] to [10]
[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero
	at org.opensearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:209) ~[opensearch-core-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:275) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.setCancellationBurst(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.Setting$Updater.apply(Setting.java:1254) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:696) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:232) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:558) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-2.12.0.jar:2.12.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalArgumentException: rate must be greater than zero
	at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:52) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:47) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.SearchBackpressureState.onRateChanged(SearchBackpressureState.java:95) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.SearchBackpressureState.onBurstChanged(SearchBackpressureState.java:101) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.lambda$setCancellationBurst$2(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:269) ~[opensearch-2.12.0.jar:2.12.0]
	... 13 more

, the cluster_manger is not able to apply the invalid settings because the cluster state is corrupt, after execute ./bin/opensearch-node remove-settings search_backpressure.cancellation_burst, to remove the invalid settings from the cluster state, then the cluster comes back.

gaobinlong avatar Oct 17 '24 04:10 gaobinlong