OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Segment Replication] Synchronize checkpoint updates with failover.

Open mch2 opened this issue 3 years ago • 0 comments

With https://github.com/opensearch-project/OpenSearch/pull/4135 and #3989, basic failover support is added for shards with segment replication enabled.

However, this change does not consider what happens to ongoing or incoming copy events during failover.

Replicas should remain as swappable backups that recovery quickly, so I do not think we should wait for file copy to complete for an ongoing replication. The replica should cancel the event and begin its failover steps (commit & rewire its engine). However, If a replica has an ongoing copy event that is in the finalize step, meaning all segments for a new checkpoint have arrived and the only remaining step is to wire into its directory reader, I think we can let it complete and then continue?

mch2 avatar Aug 04 '22 23:08 mch2