cross-cluster-replication icon indicating copy to clipboard operation
cross-cluster-replication copied to clipboard

[BUG] Stop Replication causes a lot of failled recovery in follower OpenSearch while writes are underway on leader index

Open gbbafna opened this issue 4 years ago • 2 comments

Describe the bug When we stop an ongoing replication with active indexing happening on leader , the follower OS Process gets lot of " failed recovery" exceptions .

To Reproduce

  1. start replication
  2. start continuos indexing in a loop on replicated index
  3. stop replication

Additional context

[2021-10-26T05:48:56,418][WARN ][o.e.i.c.IndicesClusterStateService] [9e1fd8ef64269939f1dda9fe6ab88a1f] [grab3][2] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[grab3][2]: Recovery failed on {9e1fd8ef64269939f1dda9fe6ab88a1f}{qfPUCCDURWizi5yTknKaLg}{CKSwuUFIToa3YGazS4jCYQ}{1.2.3.4}{1.2.3.4:9300}{dimr}{dp_version=20210401, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, zone=us-east-1d, cross_cluster_transport_address=a:b:c:d, shard_indexing_pressure_enabled=true, di_number=0}]; nested: IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [11663] from last commit does not match global checkpoint [9863]];
        at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:2746)
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:364)
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:328)
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:96)
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1945)
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:752)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: [grab3/flbKJOheTEe-KKXjeFoKnw][[grab3][2]] IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [11663] from last commit does not match global checkpoint [9863]];
        ... 11 more
Caused by: java.lang.IllegalStateException: Maximum sequence number [11663] from last commit does not match global checkpoint [9863]
        at org.elasticsearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:153)
        at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:112)
        at org.elasticsearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:57)
        at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1705)
        at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1671)
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:437)
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:98)
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325)
        ... 8 more

gbbafna avatar Oct 21 '21 13:10 gbbafna

followCluster.log

gbbafna avatar Oct 21 '21 13:10 gbbafna