OpenSearch [Feature Request] Add a cluster.routing.allocation.balance option that takes shard load into account

Is your feature request related to a problem? Please describe

It is possible for the existing shard rebalance algorithm to put the cluster in a state where the shards are balanced relatively evenly across the nodes in the cluster, however, one of the nodes has shards that are much more actively written to, resulting in a load imbalance.

Describe the solution you'd like

It would be great if, in addition to the current shard and index balance weightings, there was an additional load factor taken into account when balancing decisions are made.

Related component

ShardManagement:Placement

Describe alternatives you've considered

No response

Additional context

No response

Apr 03 '24 11:04 tronboto

[Triage - attendees 1 2 3 4 5 6 7 8] @tronboto Thanks for filing, we would like to review a pull request around this issue.

Apr 03 '24 15:04 peternied

This would be highly welcome feature. This would help to prevent hotspots in larger clusters and solve many problematic topics in write intensive clusters.

As an example, something similar in elastic: cluster.routing.allocation.balance.write_load

(float, Dynamic) Defines the weight factor for the write load of each shard, in terms of the estimated number of indexing threads needed by the shard. Defaults to 10.0f. Raising this value increases the tendency of Elasticsearch to equalize the total write load across nodes ahead of the other balancing variables.

Apr 09 '24 06:04 mr-mqe

+1 to this! I've been researching how to solve this very problem of "HOT" shards where some OpenSearch nodes in a cluster will have significantly more load (typically more write load but could also be read load), even as much as double the load when compared to other nodes in the cluster.

I read through all of the Allocation Deciders hoping to unlock some secrets but didn't find anything that would help with this.

Very excited to see this feature!

Jun 28 '24 18:06 drewmiranda-gl

I strongly support this feature request, as it addresses a significant issue that can arise in bigger OpenSearch clusters. The current shard rebalance does a good job of distributing shards evenly across nodes, but it doesn't really take into account the actual load that those shards might generate. This can lead to situations where certain nodes become hot spots because they are handling much more write activity than others, despite having a similar number of shards.

Aug 27 '24 09:08 Urpojugend

I have not seen anything that would resolve this problem still. Maybe I am looking at this in a wrong way? In many multitenant environments where different indexes behave very differently, this causes easily situations where some nodes are overloaded and most are idling. The amount of shards could be even, but the load is very different. This has been solved at elastic side already some time ago by the "cluster.routing.allocation.balance.write_load" setting.

Anyone got good ideas how to get around this?

May 22 '25 14:05 mr-mqe

In a balance tool we use internally, I use a linear function based on read/write counts and time (fitted using actual load data) to measure the load of a shard.

May 22 '25 15:05 ViggoC