OpenSearch [Feature Request] [Segment Replication] Balanced primary count across all nodes during rebalancing

Is your feature request related to a problem? Please describe

Currently in our system, we only have contraint for the primary shards rebalance at an index level which was introduced in #6422. Though there are cases when we need to consider the overall primary shards count in the nodes. For example:

Initial configuration Let's assume we have 5 node setup and 5 indices each having 5 primary and 1 replica configuration.

NodeID	Primary	Replica
Node1	5	5
Node2	5	5
Node3	5	5
Node4	5	5
Node5	5	5

Case 1: When we drop one node, the new distribution looks like this:

NodeID	Primary	Replica
Node1	10	2
Node2	5	8
Node3	5	7
Node4	5	8

a better distribution could be:

NodeID	Primary	Replica
Node1	6	6
Node2	6	6
Node3	6	7
Node4	7	6

In case, we have added another node in the initial configuration, the node distribution looks like this:

Case2:

NodeID	Primary	Replica
Node1	5	3
Node2	5	3
Node3	5	4
Node4	5	3
Node5	5	3
Node6	0	9

Similary for this case, a better distribution could be:

NodeID	Primary	Replica
Node1	4	4
Node2	4	4
Node3	4	4
Node4	4	4
Node5	4	5
Node6	5	4

We can clearly see that the primary shards are skewed in both the cases distribution and we can have better distributions in the nodes during rebalancing.

Describe the solution you'd like

Related component

Storage:Performance

Describe alternatives you've considered

No response

Additional context

No response

Feb 08 '24 08:02 Arpit-Bandejiya

@dreamer-89 @ashking94 please provide your inputs on the issue. Will be adding more details around how the existing algorithm for relocation works and what improvements we can do. Thanks!

Feb 08 '24 10:02 Arpit-Bandejiya

The current imbalance originates primarily due to the reason that we do not consider the overall node primary count during the rebalance. In Segment replication, this cause more issues since primaries are doing majority of the heavy lifting.

Rather than doing another round of rebalancing as discussed in #6642, @dreamer-89 I'm thinking of the following:

Let's add another constraint of cluster level primary shard balance, so that the nodes with more number of primaries in the cluster could be given more weight during relocation similar to what is done here for allocation. We will need to change the logic here, since we will need to allow relocation when index level the primary shards are balanced but at cluster level they are uneven. I'm thinking of the following condition instead:

                if (preferPrimaryBalance == true
                    && shard.primary()
- && maxNode.numPrimaryShards(shard.getIndexName()) - minNode.numPrimaryShards(shard.getIndexName()) < 2) {
+                  && maxNode.numPrimaryShards() - minNode.numPrimaryShards() < 2) {
                    continue;
                }

When we try to relocate the shards here, we can add logic to check if replica exist on the other node and we can promote the other one instead as suggested in #6481

I think this will reduce the total relocation since we will be considering both constraints in the same iterations of rebalanceByWeights and handling the relocation of primary and replication shards as well.

Thoughts @dreamer-89 @ashking94 ?

Feb 12 '24 09:02 Arpit-Bandejiya

Did an POC on the above Idea.. Here are the initial results:

For simplicity, let's take an 4 node cluster with 4 indices each having 4 primary and 4 replica shards.

===================================================
NODE_T1             
P                    4                   
R                    4                               
NODE_T2             
P                    4                   
R                    4                                
NODE_T3             
P                    4                   
R                    4                                
NODE_T4             
P                    4                   
R                    4                             

Unassigned           (P)0     (R)0    

Total Shards         (P)16    (R)16

Now, we drop one of the nodes and this is what the shards distribution looks like in the cluster,

As per the current algorithm:

===================================================
NODE_T2             
P                    8                   
R                    2                   
NODE_T3             
P                    4                   
R                    7                              
NODE_T4             
P                    4                   
R                    7                           

Unassigned           (P)0     (R)0    

Total Shards         (P)16    (R)16

With the changes mention above(Only part 1)

===================================================
NODE_T1             
P                    6                   
R                    4                             
NODE_T2             
P                    5                   
R                    6                                 
NODE_T4             
P                    5                   
R                    6                              

Unassigned           (P)0     (R)0    

Total Shards         (P)16    (R)16

Feb 12 '24 09:02 Arpit-Bandejiya

Thanks @Arpit-Bandejiya for putting this up. Did we check if can re use allocation constraints mechanism to achieve this - https://github.com/elastic/elasticsearch/issues/43350?

Feb 12 '24 09:02 imRishN

Yes, @imRishN. This approach extends the same allocation constraints mechanism to rank the nodes during rebalancing.

Feb 12 '24 10:02 Arpit-Bandejiya

@Arpit-Bandejiya Thanks for the POC. This looks promising. Around the change, I believe that there are 2 settings that allow to place shards per index per node and total shard per node in a cluster. So, it looks like that this change should be less intrusive. We should also be ensuring the following things -

If the node count increases (lets say to 1k hypothetically), how many relocations will happen with and without this change. Lets check for more number of combination of index and node count and compare it without the change.
You only mentioned this - if there are more reroutes, does the routing changes with each reroute?

Feb 12 '24 11:02 ashking94

Before discussing how many shard relocation happen. We need to understand how the shards are assigned in initial state. For example, let's assume we have an 3 node cluster with 3 indices each having 3 primary and 3 replica. Shard level assigment look like this:

Index - 1

ShardID	Primary Node	Replica Node
0	N1	N2
1	N2	N3
2	N3	N1

Index - 2

ShardID	Primary Node	Replica Node
0	N1	N2
1	N2	N3
2	N3	N1

Index - 3

ShardID	Primary Node	Replica Node
0	N1	N2
1	N2	N3
2	N3	N1

Now let's assume node N1 goes down,

If we can check above all the replica for the primary shard assigned on N1 initially lies on N2. So when the unassigned shards get assigned due to failover, we get the following distribution:

Node	Primary count	Replica count
N2	3 -> 6 (due to promotion)	3
N3	3	3 ->6 (due to balancing the shard count)

Now with the above distribution, the cluster goes to re-balancing phase. Since the shards are already skewed in primary, we need to do more relocations for primary shards balancing.

So for the above case, when we rebalance with the existing logic:

Rebalancing with existing logic(shard balance): 0 Rebalancing with new logic(primary balance): 2

For the case, when we have 4 nodes with 4 indices and 4 primary and 4 replica. We got the following:

Rebalancing with existing logic(shard balance): 2 Rebalancing with new logic(primary balance): 6

Initial state:

Node	Primary count	Replica count
N1	4	4
N2	4	4
N3	4	4
N4	4	4

Intermediate state

Node	Primary count	Replica count
N2	8	0
N3	4	8
N4	4	8

Final state(current approach of shard balance)

Node	Primary count	Replica count
N2	8	2
N3	4	7
N4	4	7

Total relocation: 2

Final state(current approach of shard balance)

Node	Primary count	Replica count
N2	6	4
N3	5	6
N4	5	6

Total relocation: 6

Feb 12 '24 15:02 Arpit-Bandejiya

As can be seen above, if we try to rebalance the shards based on primary shard count across the cluster. We need to come up with an better allocation strategy for shards. Currently we pick the node for the unassigned shard in decideAllocateUnassigned. In case of tie-breaker in weighted nodes, we make sure we make sure that we follow an pattern. This sometime can cause the primary skew as we have seen above.

To avoid this, we try to randomly select the nodes which have minimum weight. Added the given below logic to do an quick POC:

 // Maintain the list of node which have min weight
        List<BalancedShardsAllocator.ModelNode> minNodes = new ArrayList<>();
        for (BalancedShardsAllocator.ModelNode node : nodes.values()) {
            if (node.containsShard(shard) && explain == false) {
                // decision is NO without needing to check anything further, so short circuit
                continue;
            }

            // weight of this index currently on the node
            float currentWeight = weight.weightWithAllocationConstraints(this, node, shard.getIndexName());
            // moving the shard would not improve the balance, and we are not in explain mode, so short circuit
            if (currentWeight > minWeight && explain == false) {
                continue;
            }

            Decision currentDecision = allocation.deciders().canAllocate(shard, node.getRoutingNode(), allocation);
            if (explain) {
                nodeExplanationMap.put(node.getNodeId(), new NodeAllocationResult(node.getRoutingNode().node(), currentDecision, 0));
                nodeWeights.add(Tuple.tuple(node.getNodeId(), currentWeight));
            }
            if (currentDecision.type() == Decision.Type.YES || currentDecision.type() == Decision.Type.THROTTLE) {
                final boolean updateMinNode;
                final boolean updateMinNodeList;
                if (currentWeight == minWeight) {
                    // debug it more
                    /*  we have an equal weight tie breaking:
                     *  1. if one decision is YES prefer it
                     *  2. prefer the node that holds the primary for this index with the next id in the ring ie.
                     *  for the 3 shards 2 replica case we try to build up:
                     *    1 2 0
                     *    2 0 1
                     *    0 1 2
                     *  such that if we need to tie-break we try to prefer the node holding a shard with the minimal id greater
                     *  than the id of the shard we need to assign. This works find when new indices are created since
                     *  primaries are added first and we only add one shard set a time in this algorithm.
                     */
                    if (currentDecision.type() == decision.type()) {
                        final int repId = shard.id();
                        final int nodeHigh = node.highestPrimary(shard.index().getName());
                        final int minNodeHigh = minNode.highestPrimary(shard.getIndexName());
                        updateMinNode = ((((nodeHigh > repId && minNodeHigh > repId) || (nodeHigh < repId && minNodeHigh < repId))
                            && (nodeHigh < minNodeHigh)) || (nodeHigh > repId && minNodeHigh < repId));

                        updateMinNodeList = true;
                        // Add node to the possible node which can be picked
                        minNodes.add(node);

                    } else {
                        updateMinNode = currentDecision.type() == Decision.Type.YES;
                        if (updateMinNode) {
                            updateMinNodeList = true;
                            minNodes.clear();
                            minNodes.add(node);
                        }
                    }
                } else {
                    updateMinNode = currentWeight < minWeight;
                    if (updateMinNode) {
                        updateMinNodeList = true;
                        minNodes.clear(); <-- clean the node list once we have minWeight
                        minNodes.add(node);
                    }
                }
                if (updateMinNode) {
                    minNode = node;
                    minWeight = currentWeight;
                    decision = currentDecision;
                }
            }
        }
        if (decision == null) {
            // decision was not set and a node was not assigned, so treat it as a NO decision
            decision = Decision.NO;
        }
        List<NodeAllocationResult> nodeDecisions = null;
        if (explain) {
            nodeDecisions = new ArrayList<>();
            // fill in the correct weight ranking, once we've been through all nodes
            nodeWeights.sort((nodeWeight1, nodeWeight2) -> Float.compare(nodeWeight1.v2(), nodeWeight2.v2()));
            int weightRanking = 0;
            for (Tuple<String, Float> nodeWeight : nodeWeights) {
                NodeAllocationResult current = nodeExplanationMap.get(nodeWeight.v1());
                nodeDecisions.add(new NodeAllocationResult(current.getNode(), current.getCanAllocateDecision(), ++weightRanking));
            }
        }

        if (minNodes.isEmpty()){
            minNode = null;
        } else {
            minNode = minNodes.get(new Random().nextInt(minNodes.size())); <-- select random node
        }
        return AllocateUnassignedDecision.fromDecision(decision, minNode != null ? minNode.getRoutingNode().node() : null, nodeDecisions);

Feb 13 '24 10:02 Arpit-Bandejiya

When we allocate the shards based on the above logic and when we try to rebalance the primary shards, we see that the number of relocations have reduced in general. For example, in the above example of 4 node cluster with 4 indices each having 4 primary and 4 replica. We saw the following from the test:

Total relocation: 2 --> In current state, where we do not rebalance primary
Total relocation: 6 --> When we try to rebalance the primary with the current allocation strategy.
Total relocation: 3-4 --> When we allocate the shards based on the min weight randomly and try to rebalance the shards.

Feb 13 '24 10:02 Arpit-Bandejiya

OpenSearch OpenSearch copied to clipboard

[Feature Request] [Segment Replication] Balanced primary count across all nodes during rebalancing

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

OpenSearch
OpenSearch copied to clipboard