[BUG] Shard fails to re-assign after a rolling restart
Describe the bug
During a rolling restart of our OpenSearch cluster, some replica shards fail to re-assign to available nodes. The logs indicate that the destination node rejects the data because a "stale metadata checkpoint" is received from the primary shard. This suggests that the primary's state is changing during the recovery process, leading to a replication failure.
The shard fails to assign itself after 5 retries, and the cluster gives up. The log message explicitly states, shard has exceeded the maximum number of retries [5] on failed allocation attempts. The root cause is identified as a ReplicationFailedException due to a stale checkpoint.
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-03T23:50:03.463Z], failed_attempts[5], failed_nodes[[Qm1RnXJQQYqSrlqcBq-X6Q]], delayed=false, details[failed shard on node [Qm1RnXJQQYqSrlqcBq-X6Q]: failed recovery, failure RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true} ([logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-g4nt][10.202.0.19:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-lntz][10.202.0.17:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=171, version=13559, size=32294083233, codec=ZSTD912, timestamp=0}] since initial checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=1066, version=14449, size=32294083233, codec=ZSTD101, timestamp=1756943403278888666}] is ahead of it]; ], allocation_status[no_attempt]]]
This appears to be a bug where the primary and replica shards get out of sync during the recovery process. The primary's state changes while it's trying to send an old copy of the data, which the new replica correctly rejects.
Related component
Other
To Reproduce
- Disable shard allocation.
- Restart an OpenSearch node.
- Enable shard allocation.
- The cluster never becomes
green, as the shards remain unassigned, preventing subsequent steps in the rolling restart process.
Expected behavior
The shard should successfully re-assign to the new node, completing the recovery process, and the cluster should transition back to a green status.
Additional Details
Environment
-
OpenSearch Version:
3.2.0 -
JVM Version:
OpenJDK Runtime Environment Temurin-24.0.2+12 (build 24.0.2+12 -
OS:
Ubuntu 22.04
Here is the exception from the logs
[2025-09-04T20:17:29,251][WARN ][o.o.i.c.IndicesClusterStateService] [opensearch-logs-rgn5] [logstash-2025.07.26][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [logstash-2025.07.26][0]: Recovery failed from {opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {opensearch-logs-rgn5}{2E9EHCzkSYaf4BhUL4IkDw}{BEmSmiNiROqAgW57jHV8Pw}{10.202.0.106}{10.202.0.106:9300}{dimr}{shard_indexing_pressure_enabled=true} ([logstash-2025.07.26][0]: Recovery failed from {opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {opensearch-logs-rgn5}{2E9EHCzkSYaf4BhUL4IkDw}{BEmSmiNiROqAgW57jHV8Pw}{10.202.0.106}{10.202.0.106:9300}{dimr}{shard_indexing_pressure_enabled=true})
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) [opensearch-3.2.0.jar:3.2.0]
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) [opensearch-3.2.0.jar:3.2.0]
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) [opensearch-3.2.0.jar:3.2.0]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:759) [opensearch-3.2.0.jar:3.2.0]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:689) [opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1607) [opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.NativeMessageHandler.lambda$handleException$0(NativeMessageHandler.java:495) [opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) [opensearch-3.2.0.jar:3.2.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1447) [?:?]
Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [logstash-2025.07.26][0]: Recovery failed from {opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {opensearch-logs-rgn5}{2E9EHCzkSYaf4BhUL4IkDw}{BEmSmiNiROqAgW57jHV8Pw}{10.202.0.106}{10.202.0.106:9300}{dimr}{shard_indexing_pressure_enabled=true}
... 8 more
Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-logs-g4nt][10.202.0.19:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-logs-rgn5][10.202.0.106:9300][internal:index/shard/replication/segments_sync]
Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:349) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.indices.replication.AbstractSegmentReplicationTarget.lambda$startReplication$1(AbstractSegmentReplicationTarget.java:168) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
at java.util.ArrayList.forEach(ArrayList.java:1604) ~[?:?]
at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) ~[opensearch-core-3.2.0.jar:3.2.0]
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:468) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$0(NativeMessageHandler.java:462) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[opensearch-3.2.0.jar:3.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) ~[?:?]
at java.lang.Thread.run(Thread.java:1447) ~[?:?]
Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.07.26][0], primaryTerm=6, segmentsGen=107, version=11546, size=18157580954, codec=ZSTD912, timestamp=0}] since initial checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.07.26][0], primaryTerm=6, segmentsGen=111, version=11548, size=18157580954, codec=ZSTD101, timestamp=1757017048994600029}] is ahead of it
at org.opensearch.indices.replication.AbstractSegmentReplicationTarget.lambda$startReplication$1(AbstractSegmentReplicationTarget.java:168) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
at java.util.ArrayList.forEach(ArrayList.java:1604) ~[?:?]
at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) ~[opensearch-core-3.2.0.jar:3.2.0]
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:468) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$0(NativeMessageHandler.java:462) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[opensearch-3.2.0.jar:3.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) ~[?:?]
at java.lang.Thread.run(Thread.java:1447) ~[?:?]
We have this exact issue as well. OpenSearch & Opensearch Dashboards 3.2.0 deployed via opensearch-k8s-operator chart. I initially thought that the issue was that we had segment replication enabled by default in opensearch.yml, but after disabling segment replication and removing any shard allocation awareness attributes, i am still experiencing this issue.
The only way to fix this is to run POST _cluster/reroute?retry_failed=true i've found.
From GET _cluster/allocation/explain
{
"index": ".ds-prod-example-index-log-000168",
"shard": 0,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2025-09-15T03:45:07.220Z",
"failed_allocation_attempts": 5,
"details": "failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ",
"last_allocation_status": "no_attempt"
},
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions": [
{
"node_id": "-4VZHgecSyKJc8GxpIHG8g",
"node_name": "prod-dc05-nodes-1",
"transport_address": "10.196.0.21:9300",
"node_attributes": {
"zone": "dc05",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "1IbgAt0fRP-S4MCkhGwfVA",
"node_name": "prod-dc07-nodes-2",
"transport_address": "10.196.5.239:9300",
"node_attributes": {
"zone": "dc07",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "4yoJ3wSsQOeBnT7NiJFLaQ",
"node_name": "prod-dc06-nodes-0",
"transport_address": "10.196.1.236:9300",
"node_attributes": {
"zone": "dc06",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "Xwayap-FSlWXMJGBnAhE6g",
"node_name": "prod-dc06-nodes-3",
"transport_address": "10.196.7.142:9300",
"node_attributes": {
"zone": "dc06",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "aJWWhXYwS9mVnYyga4-dJg",
"node_name": "prod-dc05-nodes-0",
"transport_address": "10.196.4.230:9300",
"node_attributes": {
"zone": "dc05",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "jV3xc6GxQfWtFbZsPdsEcA",
"node_name": "prod-dc07-nodes-3",
"transport_address": "10.196.5.119:9300",
"node_attributes": {
"zone": "dc07",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "jWwx8zSRQIuK6dyhRSI-2w",
"node_name": "prod-dc07-nodes-0",
"transport_address": "10.196.3.244:9300",
"node_attributes": {
"zone": "dc07",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "jbE5-elNQsGn8PzTJS-eEg",
"node_name": "prod-dc06-nodes-1",
"transport_address": "10.196.8.86:9300",
"node_attributes": {
"zone": "dc06",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "pyoSCpbwRcm-O4ks7hpZKw",
"node_name": "prod-dc07-nodes-1",
"transport_address": "10.196.2.90:9300",
"node_attributes": {
"zone": "dc07",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "qIsA2LgnTfmH0uDuaOYa3A",
"node_name": "prod-dc06-nodes-2",
"transport_address": "10.196.7.52:9300",
"node_attributes": {
"zone": "dc06",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
},
{
"decider": "same_shard",
"decision": "NO",
"explanation": "a copy of this shard is already allocated to this node [[.ds-prod-example-index-log-000168][0], node[qIsA2LgnTfmH0uDuaOYa3A], [P], s[STARTED], a[id=pxd7Wo5QRAmaE6bVg6MpmA]]"
}
]
},
{
"node_id": "vapazD8aTY-_6wUEpFivuQ",
"node_name": "prod-dc05-nodes-2",
"transport_address": "10.196.6.151:9300",
"node_attributes": {
"zone": "dc05",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id": "x6D98Y6kT5WtqWiMVo-NHw",
"node_name": "prod-dc05-nodes-3",
"transport_address": "10.196.4.189:9300",
"node_attributes": {
"zone": "dc05",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
}
]
}
]
}
We are experiencing exactly the same issue as @shamil. Running version 3.2.0 with segment replication.
@shamil, did you find a fix/workaround? Do you have segment replication configured?
I did some testing, it only happens with segment replication enabled. (we also had set: segrep.pressure.enabled: true, but maybe unrelated)
Rolling back those two changes is fixing the issue for us.
We enabled this cluster wide for all indices, even system indices. We have disabled segment replication for all indices going forward but already created indices have segment replication enabled and as far as i know it can't be changed to document replication after the index has been created?
Our system indices have segment replication enabled as well. Is there a way to perhaps reindex these system indices so that we can disable segment replication for all system indices at least?
just wanted to confirm the issue still exists in 3.3.1
Same issue after upgrading 2.17 -> 2.19. With 2.17, i never encountered this error.