cross-cluster-replication
cross-cluster-replication copied to clipboard
[BUG] Stop Replication causes a lot of failled recovery in follower OpenSearch while writes are underway on leader index
Describe the bug When we stop an ongoing replication with active indexing happening on leader , the follower OS Process gets lot of " failed recovery" exceptions .
To Reproduce
- start replication
- start continuos indexing in a loop on replicated index
- stop replication
Additional context
[2021-10-26T05:48:56,418][WARN ][o.e.i.c.IndicesClusterStateService] [9e1fd8ef64269939f1dda9fe6ab88a1f] [grab3][2] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[grab3][2]: Recovery failed on {9e1fd8ef64269939f1dda9fe6ab88a1f}{qfPUCCDURWizi5yTknKaLg}{CKSwuUFIToa3YGazS4jCYQ}{1.2.3.4}{1.2.3.4:9300}{dimr}{dp_version=20210401, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, zone=us-east-1d, cross_cluster_transport_address=a:b:c:d, shard_indexing_pressure_enabled=true, di_number=0}]; nested: IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [11663] from last commit does not match global checkpoint [9863]];
at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:2746)
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:364)
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:328)
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:96)
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1945)
at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:752)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: [grab3/flbKJOheTEe-KKXjeFoKnw][[grab3][2]] IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [11663] from last commit does not match global checkpoint [9863]];
... 11 more
Caused by: java.lang.IllegalStateException: Maximum sequence number [11663] from last commit does not match global checkpoint [9863]
at org.elasticsearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:153)
at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:112)
at org.elasticsearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:57)
at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1705)
at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1671)
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:437)
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:98)
at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325)
... 8 more