OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Shard fails to re-assign after a rolling restart

Open shamil opened this issue 4 months ago • 8 comments

Describe the bug

During a rolling restart of our OpenSearch cluster, some replica shards fail to re-assign to available nodes. The logs indicate that the destination node rejects the data because a "stale metadata checkpoint" is received from the primary shard. This suggests that the primary's state is changing during the recovery process, leading to a replication failure.

The shard fails to assign itself after 5 retries, and the cluster gives up. The log message explicitly states, shard has exceeded the maximum number of retries [5] on failed allocation attempts. The root cause is identified as a ReplicationFailedException due to a stale checkpoint.

shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-03T23:50:03.463Z], failed_attempts[5], failed_nodes[[Qm1RnXJQQYqSrlqcBq-X6Q]], delayed=false, details[failed shard on node [Qm1RnXJQQYqSrlqcBq-X6Q]: failed recovery, failure RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true} ([logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[logstash-2025.08.22][5]: Recovery failed from {prod-eu2-opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {prod-eu2-opensearch-logs-lntz}{Qm1RnXJQQYqSrlqcBq-X6Q}{6FOVN8scQTWmeIvfSoG8pQ}{10.202.0.17}{10.202.0.17:9300}{dimr}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-g4nt][10.202.0.19:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-eu2-opensearch-logs-lntz][10.202.0.17:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=171, version=13559, size=32294083233, codec=ZSTD912, timestamp=0}] since initial checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.08.22][5], primaryTerm=3, segmentsGen=1066, version=14449, size=32294083233, codec=ZSTD101, timestamp=1756943403278888666}] is ahead of it]; ], allocation_status[no_attempt]]]

This appears to be a bug where the primary and replica shards get out of sync during the recovery process. The primary's state changes while it's trying to send an old copy of the data, which the new replica correctly rejects.

Related component

Other

To Reproduce

  1. Disable shard allocation.
  2. Restart an OpenSearch node.
  3. Enable shard allocation.
  4. The cluster never becomes green, as the shards remain unassigned, preventing subsequent steps in the rolling restart process.

Expected behavior

The shard should successfully re-assign to the new node, completing the recovery process, and the cluster should transition back to a green status.

Additional Details

Environment

  • OpenSearch Version: 3.2.0
  • JVM Version: OpenJDK Runtime Environment Temurin-24.0.2+12 (build 24.0.2+12
  • OS: Ubuntu 22.04

shamil avatar Sep 04 '25 00:09 shamil

Here is the exception from the logs

[2025-09-04T20:17:29,251][WARN ][o.o.i.c.IndicesClusterStateService] [opensearch-logs-rgn5] [logstash-2025.07.26][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [logstash-2025.07.26][0]: Recovery failed from {opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {opensearch-logs-rgn5}{2E9EHCzkSYaf4BhUL4IkDw}{BEmSmiNiROqAgW57jHV8Pw}{10.202.0.106}{10.202.0.106:9300}{dimr}{shard_indexing_pressure_enabled=true} ([logstash-2025.07.26][0]: Recovery failed from {opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {opensearch-logs-rgn5}{2E9EHCzkSYaf4BhUL4IkDw}{BEmSmiNiROqAgW57jHV8Pw}{10.202.0.106}{10.202.0.106:9300}{dimr}{shard_indexing_pressure_enabled=true})
	at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) [opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) [opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) [opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:759) [opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:689) [opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1607) [opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.NativeMessageHandler.lambda$handleException$0(NativeMessageHandler.java:495) [opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) [opensearch-3.2.0.jar:3.2.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1447) [?:?]
Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [logstash-2025.07.26][0]: Recovery failed from {opensearch-logs-g4nt}{jh9ILZaaQvOZGcrO3MiFwA}{CY6nBxCVSeStO8Tv-oTAuQ}{10.202.0.19}{10.202.0.19:9300}{dimr}{shard_indexing_pressure_enabled=true} into {opensearch-logs-rgn5}{2E9EHCzkSYaf4BhUL4IkDw}{BEmSmiNiROqAgW57jHV8Pw}{10.202.0.106}{10.202.0.106:9300}{dimr}{shard_indexing_pressure_enabled=true}
	... 8 more
Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-logs-g4nt][10.202.0.19:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-logs-rgn5][10.202.0.106:9300][internal:index/shard/replication/segments_sync]
Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:349) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.indices.replication.AbstractSegmentReplicationTarget.lambda$startReplication$1(AbstractSegmentReplicationTarget.java:168) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
	at java.util.ArrayList.forEach(ArrayList.java:1604) ~[?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) ~[opensearch-core-3.2.0.jar:3.2.0]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:468) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$0(NativeMessageHandler.java:462) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[opensearch-3.2.0.jar:3.2.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) ~[?:?]
	at java.lang.Thread.run(Thread.java:1447) ~[?:?]
Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.07.26][0], primaryTerm=6, segmentsGen=107, version=11546, size=18157580954, codec=ZSTD912, timestamp=0}] since initial checkpoint [ReplicationCheckpoint{shardId=[logstash-2025.07.26][0], primaryTerm=6, segmentsGen=111, version=11548, size=18157580954, codec=ZSTD101, timestamp=1757017048994600029}] is ahead of it
	at org.opensearch.indices.replication.AbstractSegmentReplicationTarget.lambda$startReplication$1(AbstractSegmentReplicationTarget.java:168) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
	at java.util.ArrayList.forEach(ArrayList.java:1604) ~[?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) ~[opensearch-core-3.2.0.jar:3.2.0]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:468) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$0(NativeMessageHandler.java:462) ~[opensearch-3.2.0.jar:3.2.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[opensearch-3.2.0.jar:3.2.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) ~[?:?]
	at java.lang.Thread.run(Thread.java:1447) ~[?:?]

shamil avatar Sep 04 '25 20:09 shamil

We have this exact issue as well. OpenSearch & Opensearch Dashboards 3.2.0 deployed via opensearch-k8s-operator chart. I initially thought that the issue was that we had segment replication enabled by default in opensearch.yml, but after disabling segment replication and removing any shard allocation awareness attributes, i am still experiencing this issue.

The only way to fix this is to run POST _cluster/reroute?retry_failed=true i've found.

From GET _cluster/allocation/explain

{
  "index": ".ds-prod-example-index-log-000168",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2025-09-15T03:45:07.220Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "-4VZHgecSyKJc8GxpIHG8g",
      "node_name": "prod-dc05-nodes-1",
      "transport_address": "10.196.0.21:9300",
      "node_attributes": {
        "zone": "dc05",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "1IbgAt0fRP-S4MCkhGwfVA",
      "node_name": "prod-dc07-nodes-2",
      "transport_address": "10.196.5.239:9300",
      "node_attributes": {
        "zone": "dc07",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "4yoJ3wSsQOeBnT7NiJFLaQ",
      "node_name": "prod-dc06-nodes-0",
      "transport_address": "10.196.1.236:9300",
      "node_attributes": {
        "zone": "dc06",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "Xwayap-FSlWXMJGBnAhE6g",
      "node_name": "prod-dc06-nodes-3",
      "transport_address": "10.196.7.142:9300",
      "node_attributes": {
        "zone": "dc06",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "aJWWhXYwS9mVnYyga4-dJg",
      "node_name": "prod-dc05-nodes-0",
      "transport_address": "10.196.4.230:9300",
      "node_attributes": {
        "zone": "dc05",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "jV3xc6GxQfWtFbZsPdsEcA",
      "node_name": "prod-dc07-nodes-3",
      "transport_address": "10.196.5.119:9300",
      "node_attributes": {
        "zone": "dc07",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "jWwx8zSRQIuK6dyhRSI-2w",
      "node_name": "prod-dc07-nodes-0",
      "transport_address": "10.196.3.244:9300",
      "node_attributes": {
        "zone": "dc07",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "jbE5-elNQsGn8PzTJS-eEg",
      "node_name": "prod-dc06-nodes-1",
      "transport_address": "10.196.8.86:9300",
      "node_attributes": {
        "zone": "dc06",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "pyoSCpbwRcm-O4ks7hpZKw",
      "node_name": "prod-dc07-nodes-1",
      "transport_address": "10.196.2.90:9300",
      "node_attributes": {
        "zone": "dc07",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "qIsA2LgnTfmH0uDuaOYa3A",
      "node_name": "prod-dc06-nodes-2",
      "transport_address": "10.196.7.52:9300",
      "node_attributes": {
        "zone": "dc06",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.ds-prod-example-index-log-000168][0], node[qIsA2LgnTfmH0uDuaOYa3A], [P], s[STARTED], a[id=pxd7Wo5QRAmaE6bVg6MpmA]]"
        }
      ]
    },
    {
      "node_id": "vapazD8aTY-_6wUEpFivuQ",
      "node_name": "prod-dc05-nodes-2",
      "transport_address": "10.196.6.151:9300",
      "node_attributes": {
        "zone": "dc05",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id": "x6D98Y6kT5WtqWiMVo-NHw",
      "node_name": "prod-dc05-nodes-3",
      "transport_address": "10.196.4.189:9300",
      "node_attributes": {
        "zone": "dc05",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2025-09-15T03:45:07.220Z], failed_attempts[5], failed_nodes[[x6D98Y6kT5WtqWiMVo-NHw]], delayed=false, details[failed shard on node [x6D98Y6kT5WtqWiMVo-NHw]: failed recovery, failure RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true} ([.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[.ds-prod-example-index-log-000168][0]: Recovery failed from {prod-dc06-nodes-2}{qIsA2LgnTfmH0uDuaOYa3A}{H2y6NeTrQyKRT1XH6Iz2bg}{prod-dc06-nodes-2}{10.196.7.52:9300}{d}{zone=dc06, shard_indexing_pressure_enabled=true} into {prod-dc05-nodes-3}{x6D98Y6kT5WtqWiMVo-NHw}{MjijdrZ7R8mBlZJhSAVSEQ}{prod-dc05-nodes-3}{10.196.4.189:9300}{d}{zone=dc05, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[prod-dc06-nodes-2][10.196.7.52:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[prod-dc05-nodes-3][10.196.4.189:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: ReplicationFailedException[Rejecting stale metadata checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=362, version=10105, size=44528425663, codec=Lucene101, timestamp=1757601148237536190}] since initial checkpoint [ReplicationCheckpoint{shardId=[.ds-prod-example-index-log-000168][0], primaryTerm=13, segmentsGen=371, version=10110, size=44528425663, codec=Lucene101, timestamp=1757907907197675804}] is ahead of it]; ], allocation_status[no_attempt]]]"
        }
      ]
    }
  ]
}

vinylen avatar Sep 15 '25 05:09 vinylen

We are experiencing exactly the same issue as @shamil. Running version 3.2.0 with segment replication.

@shamil, did you find a fix/workaround? Do you have segment replication configured?

thomas315 avatar Sep 22 '25 09:09 thomas315

I did some testing, it only happens with segment replication enabled. (we also had set: segrep.pressure.enabled: true, but maybe unrelated) Rolling back those two changes is fixing the issue for us.

thomas315 avatar Sep 22 '25 10:09 thomas315

Catch All Triage - 1 2

krisfreedain avatar Sep 22 '25 16:09 krisfreedain

We enabled this cluster wide for all indices, even system indices. We have disabled segment replication for all indices going forward but already created indices have segment replication enabled and as far as i know it can't be changed to document replication after the index has been created?

Our system indices have segment replication enabled as well. Is there a way to perhaps reindex these system indices so that we can disable segment replication for all system indices at least?

vinylen avatar Sep 30 '25 09:09 vinylen

just wanted to confirm the issue still exists in 3.3.1

thomas315 avatar Oct 23 '25 13:10 thomas315

Same issue after upgrading 2.17 -> 2.19. With 2.17, i never encountered this error.

tmanninger avatar Dec 01 '25 10:12 tmanninger