OpenSearch
OpenSearch copied to clipboard
[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard
Coming from https://github.com/opensearch-project/OpenSearch/issues/3988 where RoutingNodes#failShard was identified as another workflow where master eagerly promotes replica as part of node removal workflow. failShard method is also handles cluster state updates (e.g. assigned shards etc).
Broken down tasks in following sub-tasks:
- [x] Identify cause of two different failover mechanism i.e. PrimaryShardAllocator & RoutingNodes#failShard
- [x] Code scan to identify cause of separate failover handling
- [x] Test if returning empty from activeReplicaWithHighestVersion doesn't cause shard failure (red cluster) and shard promotion works (with minor delay)
- [x] Identify is RoutingNodes#failShard can be removed all together.
- [ ] Write a unit test with 1p and 2 replica on different checkpoints. Fail primary and check replica is promoted irrespective of replica's checkpoint state.
- [ ] ~~Add transport calls to active replicas to fetch shard state via AsyncShardFetch. This can be achieved by abstracting logic added in https://github.com/opensearch-project/OpenSearch/pull/4041~~ Need to add sync calls to obtain the shard metadata from active shards
- [ ] Update RoutingNodes#failShard to use the ReplicationCheckpoint info to select relica with highest RC
- [ ] Ensure unit tests intially written passes, add more unit tests
- [ ] Milestone 1. Primary promotion with segrep happy path works
- [ ] Wait for https://github.com/opensearch-project/OpenSearch/issues/3989 or add stub to prevent failure during primary promotions
- [ ] Write thorought integration tests around replica promotion as its a core change
- [ ] Fix new bugs if any
RoutingNodes#failShard method can not be removed as
- It is used by master to immediately promote one of active shards (from cluster state) to avoid shard failure.
- The method is followed up with cluster balancing actions which needs an active primary.
Removing failShard method would need lot of core level changes in shard allocation, which is not intended as part of issue.
There are following approaches for handling RoutingNodes#failShard to include ReplicationCheckpoint. There are three options here:
- Include ReplicationCheckpoint in ClusterState. During failover, cluster-manager can simply choose shard copy with highest ReplicationCheckpoint. This is not correct as ReplicationCheckpoint is updated on index refreshes and is not ideal to be part of ClusterState. Also, it looks like be a lot bigger change and not necessary as checkpoints are used only during failover to identify furthest ahead replica.
- Pull active shard's ReplicationCheckpoint synchronously. This can be problematic for big clusters (with high replica & shard count), where some delay in shard promotion may be observed.
- [Proposed initially on this issue] Using AsyncShardFetch to fetch shard data asynchronously workflow. This is problematic as cluster-manager will not be able to promote active shard copy immediately but there are assertions on primary nodes; post reroute as part of cluster shard balancing actions.
I am planning to move forward with option 2 above.
CC @Bukhtawar @mch2 @andrross
I think 2 is the best option given we want this as a best effort. I also wouldn't be worried about it delaying shard promotion right now, we can set timeouts to reduce that impact and measure the total time.
During standup discussion among @mch2 @kartg, we decided to take up this issue as part of next minor release and focus on existing open bugs in https://github.com/opensearch-project/OpenSearch/issues/3969 which have higher priority. This work basically is an optimization work which reduces the segment files copies amoung repilcas.
CC @CEHENKLE @anasalkouz @mch2
Discussed this during team standup, where we identified that we need to get data around segment replication performance when furthese ahead replica is not chosen. This is to also to evaluate the trade off we will get with implementing this core change.
CC @mch2 @anasalkouz @Bukhtawar
@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?
@dreamer-89 @mch2 this is tagged for 2.5. Can we make it ?
Thank you @saratvemulapalli for bringing this up. This work will not make into 2.5.0 release, removing the tag.
From previous discussion, this is an optimization task which tries to select the replica with highest checkpoint (to ensure minimum file copy ops from new selected primary & prevent segment conflicts). We also don't have data around how bad this I/O can go if we do not select the replica with highest replication checkpoint. The segment conflicts are avoided today by bumping the SegGen on selected primary.
Even with approach 2 above (sync call to replicas to fetch highest replication checkpoint), this solution will be best effort and can't guarantee the selection of furthest ahead replica; which leaves room for segment conflict. Based on this, prioritizing existing GA task over this.
CC @mch2 @anasalkouz