OpenSearch [Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard

Coming from https://github.com/opensearch-project/OpenSearch/issues/3988 where RoutingNodes#failShard was identified as another workflow where master eagerly promotes replica as part of node removal workflow. failShard method is also handles cluster state updates (e.g. assigned shards etc).

Aug 04 '22 19:08 dreamer-89

Broken down tasks in following sub-tasks:

[x] Identify cause of two different failover mechanism i.e. PrimaryShardAllocator & RoutingNodes#failShard
- [x] Code scan to identify cause of separate failover handling
- [x] Test if returning empty from activeReplicaWithHighestVersion doesn't cause shard failure (red cluster) and shard promotion works (with minor delay)
- [x] Identify is RoutingNodes#failShard can be removed all together.

[ ] Write a unit test with 1p and 2 replica on different checkpoints. Fail primary and check replica is promoted irrespective of replica's checkpoint state.
[ ] ~~Add transport calls to active replicas to fetch shard state via AsyncShardFetch. This can be achieved by abstracting logic added in https://github.com/opensearch-project/OpenSearch/pull/4041~~ Need to add sync calls to obtain the shard metadata from active shards
[ ] Update RoutingNodes#failShard to use the ReplicationCheckpoint info to select relica with highest RC
[ ] Ensure unit tests intially written passes, add more unit tests
[ ] Milestone 1. Primary promotion with segrep happy path works
[ ] Wait for https://github.com/opensearch-project/OpenSearch/issues/3989 or add stub to prevent failure during primary promotions
[ ] Write thorought integration tests around replica promotion as its a core change
[ ] Fix new bugs if any

Aug 05 '22 17:08 dreamer-89

RoutingNodes#failShard method can not be removed as

It is used by master to immediately promote one of active shards (from cluster state) to avoid shard failure.
The method is followed up with cluster balancing actions which needs an active primary.

Removing failShard method would need lot of core level changes in shard allocation, which is not intended as part of issue.

Aug 09 '22 19:08 dreamer-89

There are following approaches for handling RoutingNodes#failShard to include ReplicationCheckpoint. There are three options here:

Include ReplicationCheckpoint in ClusterState. During failover, cluster-manager can simply choose shard copy with highest ReplicationCheckpoint. This is not correct as ReplicationCheckpoint is updated on index refreshes and is not ideal to be part of ClusterState. Also, it looks like be a lot bigger change and not necessary as checkpoints are used only during failover to identify furthest ahead replica.
Pull active shard's ReplicationCheckpoint synchronously. This can be problematic for big clusters (with high replica & shard count), where some delay in shard promotion may be observed.
[Proposed initially on this issue] Using AsyncShardFetch to fetch shard data asynchronously workflow. This is problematic as cluster-manager will not be able to promote active shard copy immediately but there are assertions on primary nodes; post reroute as part of cluster shard balancing actions.

I am planning to move forward with option 2 above.

CC @Bukhtawar @mch2 @andrross

Aug 09 '22 20:08 dreamer-89

I think 2 is the best option given we want this as a best effort. I also wouldn't be worried about it delaying shard promotion right now, we can set timeouts to reduce that impact and measure the total time.

Aug 10 '22 00:08 mch2

During standup discussion among @mch2 @kartg, we decided to take up this issue as part of next minor release and focus on existing open bugs in https://github.com/opensearch-project/OpenSearch/issues/3969 which have higher priority. This work basically is an optimization work which reduces the segment files copies amoung repilcas.

CC @CEHENKLE @anasalkouz @mch2

Aug 11 '22 18:08 dreamer-89

Discussed this during team standup, where we identified that we need to get data around segment replication performance when furthese ahead replica is not chosen. This is to also to evaluate the trade off we will get with implementing this core change.

CC @mch2 @anasalkouz @Bukhtawar

Sep 16 '22 18:09 dreamer-89

@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?

Jan 05 '23 19:01 saratvemulapalli

@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?

Thank you @saratvemulapalli for bringing this up. This work will not make into 2.5.0 release, removing the tag.

From previous discussion, this is an optimization task which tries to select the replica with highest checkpoint (to ensure minimum file copy ops from new selected primary & prevent segment conflicts). We also don't have data around how bad this I/O can go if we do not select the replica with highest replication checkpoint. The segment conflicts are avoided today by bumping the SegGen on selected primary.

Even with approach 2 above (sync call to replicas to fetch highest replication checkpoint), this solution will be best effort and can't guarantee the selection of furthest ahead replica; which leaves room for segment conflict. Based on this, prioritizing existing GA task over this.

CC @mch2 @anasalkouz

Jan 06 '23 05:01 dreamer-89

OpenSearch OpenSearch copied to clipboard

[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard

OpenSearch
OpenSearch copied to clipboard