OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Fix peer recovery activity tracking

Open maxlepikhin opened this issue 3 weeks ago • 2 comments

Description

Prevent peer/segment recovery timeouts from being reset by mere status polling. Previously, every call to ReplicationCollection#get() bumped the recovery’s lastAccessTime, so a hung recovery could sit in INITIALIZING indefinitely. This change removes that implicit update and explicitly touches the timestamp only when meaningful work happens (start, file transfer, clean‑files, translog, finalize, etc.), ensuring indices.recovery.recovery_activity_timeout actually fails stuck recoveries.

Related Issues

Resolves #20177

Check List

  • [x] Functionality includes testing. (./gradlew :server:test --tests org.opensearch.recovery.ReplicationCollectionTests)
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

Summary by CodeRabbit

  • Bug Fixes

    • Ensure recovery and replication targets record last-access times when actively used, and prevent frequent polling from inadvertently extending timeouts.
    • Stop updating last-access time on simple collection reference creation to avoid misleading activity metrics.
  • Tests

    • Added tests validating last-access updates and that polling does not reset recovery/replication timeouts.

✏️ Tip: You can customize this high-level summary in your review settings.

maxlepikhin avatar Dec 06 '25 22:12 maxlepikhin

Walkthrough

Added explicit setLastAccessTime() calls at multiple recovery and replication entry points, removed an implicit touch from ReplicationRef constructor, adjusted tests (last-access polling behavior and lazy blob stream creation), and updated the changelog.

Changes

Cohort / File(s) Summary
Peer recovery access-time updates
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java
Inserted recoveryTarget.setLastAccessTime() in doRecovery and in request handlers (HandoffPrimaryContextRequestHandler, TranslogOperationsRequestHandler) so targets are touched when recovery RPCs are processed.
Replication target touches
server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java, server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java
Call setLastAccessTime() after obtaining replication targets, in handleFileChunk, and when createOrFinishListener yields a listener to update last-access timestamps on replication interactions.
Removed implicit ref touch
server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
Removed status.setLastAccessTime() from ReplicationCollection.ReplicationRef constructor so last-access is updated only at explicit entry points.
Tests: last-access behavior and timeout polling
server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java
Added testRecoveryTimeoutNotResetByPolling() and updated testLastAccessTimeUpdate() to assert that plain get() does not update lastAccessTime and that explicit setLastAccessTime() does.
Tests: lazy blob stream creation
server/src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java
Replaced static blobContainer.readBlob(...) stubs with lazy thenAnswer(...) that produces the input stream at invocation time in two tests.
Changelog
CHANGELOG.md
Added entry: Fixed hung peer recovery permanently blocks replica allocation ([#20177]).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Attention points:
    • Concurrency/ownership effects of added setLastAccessTime() calls.
    • Behavioral impact from removing the implicit touch in ReplicationRef.
    • Timing-sensitive tests in ReplicationCollectionTests.java for flakiness.
    • Ensure lazy thenAnswer(...) test changes preserve semantics.

Suggested reviewers

  • mch2
  • dbwiddis
  • msfroh

Poem

🐰 I hop to every recovery gate,
I tap the clock—record the state.
When chunks arrive or listeners call,
I mark the time to guard them all.
A tiny hop, a timestamp small.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix peer recovery activity tracking' clearly and concisely describes the main change of the PR, which is to fix how peer recovery activity is tracked and prevent hung recoveries from being reset by polling.
Description check ✅ Passed The PR description fully covers the change objectives, links to the related issue #20177, explains the fix, and includes test confirmation, though API/documentation changes are marked as not applicable.
Linked Issues check ✅ Passed The PR successfully addresses the primary objective from #20177 by preventing recovery timeouts from being reset by polling and explicitly updating lastAccessTime only during meaningful work, enabling indices.recovery.recovery_activity_timeout to fail stuck recoveries.
Out of Scope Changes check ✅ Passed The PR includes changes to RemoteRoutingTableServiceTests for blob stream handling and CHANGELOG updates, which are narrowly scoped to support the main recovery timeout fix without introducing unrelated functionality.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87f83735eda460794e84ce3e08ae1da481c5e64a and 1d46b8be1e20d8b67cf9f5e9a8535e49d69cd574.

📒 Files selected for processing (6)
  • CHANGELOG.md (1 hunks)
  • server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (3 hunks)
  • server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1 hunks)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java (0 hunks)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (2 hunks)
  • server/src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java (2 hunks)
💤 Files with no reviewable changes (1)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
🔇 Additional comments (9)
CHANGELOG.md (1)

18-18: LGTM! Changelog entry correctly documents the fix.

The entry appropriately documents the resolution of issue #20177 regarding hung peer recovery blocking replica allocation.

server/src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java (2)

576-582: LGTM! Test fix properly creates fresh streams.

The change from thenReturn to thenAnswer ensures a fresh InputStream is created on each readBlob invocation, preventing checksum/footer corruption from stream reuse.


614-616: LGTM! Consistent test fix for stream creation.

This change applies the same fix as in testGetAsyncIndexRoutingReadAction, ensuring fresh streams are created on each invocation.

server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1)

332-332: LGTM! Correct placement of lastAccessTime update.

The setLastAccessTime() call is appropriately placed right before replication starts, ensuring the activity timeout tracks meaningful work rather than status polling.

server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (3)

240-240: LGTM! Explicit timestamp update needed in doRecovery.

This explicit setLastAccessTime() is necessary since doRecovery does not call createOrFinishListener. The recovery initiation qualifies as meaningful work that should reset the activity timeout.


433-435: LGTM! Explicit timestamp update needed for primary context handoff.

This explicit setLastAccessTime() is necessary since HandoffPrimaryContextRequestHandler does not use createOrFinishListener. The primary context handoff represents meaningful recovery work.


469-469: LGTM! Timestamp update needed for retry scenarios.

While createOrFinishListener (line 449) already updates lastAccessTime, this explicit call at line 469 is necessary for the retry logic. When performTranslogOps is invoked recursively due to mapping exceptions (lines 481-490), a fresh recoveryRef is obtained and the timestamp must be updated again to track the continued activity.

server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (2)

227-229: LGTM! Conditional timestamp update prevents timeout extension on no-ops.

The conditional update ensures lastAccessTime is only modified when a non-null listener indicates the request will actually be processed, not for duplicate replays or no-ops. This aligns with the PR objective to track meaningful work.


249-252: LGTM! Timestamp update correctly placed after null check.

The setLastAccessTime() call is now properly placed after the null check, ensuring the timestamp is updated only when the file chunk will actually be processed. This addresses the inconsistency flagged in previous reviews and aligns with the pattern established in createOrFinishListener.


Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Dec 06 '25 22:12 coderabbitai[bot]

:x: Gradle check result for 4263de5090a91541ae44427114276246e17a22c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 06 '25 23:12 github-actions[bot]

@mch2 Can you take a look at this?

andrross avatar Dec 15 '25 21:12 andrross

:x: Gradle check result for 4263de5090a91541ae44427114276246e17a22c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 15 '25 22:12 github-actions[bot]

:x: Gradle check result for dfe25597fa173811fd5e6ffadcfbc966634527df: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 15 '25 23:12 github-actions[bot]

:x: Gradle check result for 71bd51f8aa351dcf59985434dcf7fa811f99ab39: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 16 '25 00:12 github-actions[bot]

:x: Gradle check result for c57ea79578eac9f6d77b3e3cef7b595740ca5c20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 16 '25 02:12 github-actions[bot]

@maxlepikhin You've got a code formatting error:

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':server:spotlessJavaCheck'.
> The following files had format violations:
      src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java
          @@ -573,11 +573,13 @@
           ············compressor,
           ············Version.CURRENT
           ········);
          -········when(blobContainer.readBlob(indexName)).thenAnswer(invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize(
          -············clusterState.getRoutingTable().getIndicesRouting().get(indexName),
          -············uploadedFileName,
          -············compressor
          -········).streamInput());
          +········when(blobContainer.readBlob(indexName)).thenAnswer(
          +············invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize(
          +················clusterState.getRoutingTable().getIndicesRouting().get(indexName),
          +················uploadedFileName,
          +················compressor
          +············).streamInput()
          +········);
           ········TestCapturingListener<IndexRoutingTable>·listener·=·new·TestCapturingListener<>();
           ········CountDownLatch·latch·=·new·CountDownLatch(1);
           
  Run './gradlew spotlessApply' to fix all violations.

andrross avatar Dec 16 '25 23:12 andrross

@maxlepikhin You've got a code formatting error:

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':server:spotlessJavaCheck'.
> The following files had format violations:
      src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java
          @@ -573,11 +573,13 @@
           ············compressor,
           ············Version.CURRENT
           ········);
          -········when(blobContainer.readBlob(indexName)).thenAnswer(invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize(
          -············clusterState.getRoutingTable().getIndicesRouting().get(indexName),
          -············uploadedFileName,
          -············compressor
          -········).streamInput());
          +········when(blobContainer.readBlob(indexName)).thenAnswer(
          +············invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize(
          +················clusterState.getRoutingTable().getIndicesRouting().get(indexName),
          +················uploadedFileName,
          +················compressor
          +············).streamInput()
          +········);
           ········TestCapturingListener<IndexRoutingTable>·listener·=·new·TestCapturingListener<>();
           ········CountDownLatch·latch·=·new·CountDownLatch(1);
           
  Run './gradlew spotlessApply' to fix all violations.

Thanks, updated.

maxlepikhin avatar Dec 16 '25 23:12 maxlepikhin

:white_check_mark: Gradle check result for b2fc04ca35d96301b3d135a5c7471c8f3551f183: SUCCESS

github-actions[bot] avatar Dec 17 '25 01:12 github-actions[bot]

Codecov Report

:x: Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review. :white_check_mark: Project coverage is 73.30%. Comparing base (d47931e) to head (87f8373). :warning: Report is 36 commits behind head on main.

Files with missing lines Patch % Lines
.../indices/replication/common/ReplicationTarget.java 66.66% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff            @@
##               main   #20178   +/-   ##
=========================================
  Coverage     73.30%   73.30%           
- Complexity    71732    71770   +38     
=========================================
  Files          5793     5793           
  Lines        328056   328117   +61     
  Branches      47245    47257   +12     
=========================================
+ Hits         240476   240532   +56     
- Misses        68264    68316   +52     
+ Partials      19316    19269   -47     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Dec 17 '25 01:12 codecov[bot]

:x: Gradle check result for 87f83735eda460794e84ce3e08ae1da481c5e64a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 17 '25 21:12 github-actions[bot]

:white_check_mark: Gradle check result for 87f83735eda460794e84ce3e08ae1da481c5e64a: SUCCESS

github-actions[bot] avatar Dec 17 '25 23:12 github-actions[bot]

I rebased and added a CHANGELOG entry

andrross avatar Dec 18 '25 19:12 andrross

:x: Gradle check result for 1d46b8be1e20d8b67cf9f5e9a8535e49d69cd574: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 18 '25 19:12 github-actions[bot]

I rebased and added a CHANGELOG entry

@andrross looks like the CI failed, ownership = DCO or something else?

maxlepikhin avatar Dec 18 '25 20:12 maxlepikhin

@maxlepikhin DCO is fine, looks like some sort of connection failure on test setup. I'll retry the tests.

Did you answer @mch2's question above?

Your logic makes sense, though I'm curious what is polling/resetting the last access time here. The internal monitor is calling get on ConcurrentHashMap here - so that shouldn't be resetting the timestamp.

api calls to cat recovery also wouldn't reset this time as its fetching and reading recoverystate.

Did you confirm that this fixes the problem you discovered?

andrross avatar Dec 18 '25 20:12 andrross

Your logic makes sense, though I'm curious what is polling/resetting the last access time here. The internal monitor is calling get on ConcurrentHashMap here - so that shouldn't be resetting the timestamp.

api calls to cat recovery also wouldn't reset this time as its fetching and reading recoverystate.

@mch2 Here is Codex' reply (I asked it to find concrete examples of the call sites). It makes sense to me.

• You’re right that the monitor and cat recovery don’t mutate it. The resets were coming from the ReplicationRef constructor calling setLastAccessTime(), so every lookup bumped the clock—
  even if no work followed. Examples:

  - PeerRecoveryTargetService handlers (e.g., file chunk at server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java:578-582, translog ops at :447-459) did
    getSafe(...) before checking for duplicates; that lookup alone refreshed the timestamp.
  - Segment replication runner (server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java:322-333) bumped it just by fetching the ref.

  We removed that implicit bump and now update lastAccessTime only when we actually handle the request (createOrFinishListener when it returns non-null, plus the few non-transport paths like
  doRecovery). Polling via get/getSafe no longer resets the timer, which fixes the observed timeout issue.

maxlepikhin avatar Dec 18 '25 21:12 maxlepikhin

:x: Gradle check result for 1d46b8be1e20d8b67cf9f5e9a8535e49d69cd574: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 18 '25 22:12 github-actions[bot]

@maxlepikhin DCO is fine, looks like some sort of connection failure on test setup. I'll retry the tests.

Did you answer @mch2's question above?

Your logic makes sense, though I'm curious what is polling/resetting the last access time here. The internal monitor is calling get on ConcurrentHashMap here - so that shouldn't be resetting the timestamp. api calls to cat recovery also wouldn't reset this time as its fetching and reading recoverystate.

Did you confirm that this fixes the problem you discovered?

  1. Replied to @mch2 , missed that comment, apologies.
  2. Re repro, spent some time trying to get a cluster in minikube into that condition with network policy or iptables unsuccessfully. So no repro to test the fix.

maxlepikhin avatar Dec 18 '25 22:12 maxlepikhin