Fix peer recovery activity tracking
Description
Prevent peer/segment recovery timeouts from being reset by mere status polling. Previously, every call to ReplicationCollection#get() bumped the recovery’s lastAccessTime, so a hung recovery could sit in INITIALIZING indefinitely. This change removes that implicit update and explicitly touches the timestamp only when meaningful work happens (start, file transfer, clean‑files, translog, finalize, etc.), ensuring indices.recovery.recovery_activity_timeout actually fails stuck recoveries.
Related Issues
Resolves #20177
Check List
- [x] Functionality includes testing. (./gradlew :server:test --tests org.opensearch.recovery.ReplicationCollectionTests)
- [ ] API changes companion pull request created, if applicable.
- [ ] Public documentation issue/PR created, if applicable.
Summary by CodeRabbit
-
Bug Fixes
- Ensure recovery and replication targets record last-access times when actively used, and prevent frequent polling from inadvertently extending timeouts.
- Stop updating last-access time on simple collection reference creation to avoid misleading activity metrics.
-
Tests
- Added tests validating last-access updates and that polling does not reset recovery/replication timeouts.
✏️ Tip: You can customize this high-level summary in your review settings.
Walkthrough
Added explicit setLastAccessTime() calls at multiple recovery and replication entry points, removed an implicit touch from ReplicationRef constructor, adjusted tests (last-access polling behavior and lazy blob stream creation), and updated the changelog.
Changes
| Cohort / File(s) | Summary |
|---|---|
Peer recovery access-time updates server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java |
Inserted recoveryTarget.setLastAccessTime() in doRecovery and in request handlers (HandoffPrimaryContextRequestHandler, TranslogOperationsRequestHandler) so targets are touched when recovery RPCs are processed. |
Replication target touches server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java, server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java |
Call setLastAccessTime() after obtaining replication targets, in handleFileChunk, and when createOrFinishListener yields a listener to update last-access timestamps on replication interactions. |
Removed implicit ref touch server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java |
Removed status.setLastAccessTime() from ReplicationCollection.ReplicationRef constructor so last-access is updated only at explicit entry points. |
Tests: last-access behavior and timeout polling server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java |
Added testRecoveryTimeoutNotResetByPolling() and updated testLastAccessTimeUpdate() to assert that plain get() does not update lastAccessTime and that explicit setLastAccessTime() does. |
Tests: lazy blob stream creation server/src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java |
Replaced static blobContainer.readBlob(...) stubs with lazy thenAnswer(...) that produces the input stream at invocation time in two tests. |
Changelog CHANGELOG.md |
Added entry: Fixed hung peer recovery permanently blocks replica allocation ([#20177]). |
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~20 minutes
- Attention points:
- Concurrency/ownership effects of added
setLastAccessTime()calls. - Behavioral impact from removing the implicit touch in
ReplicationRef. - Timing-sensitive tests in
ReplicationCollectionTests.javafor flakiness. - Ensure lazy
thenAnswer(...)test changes preserve semantics.
- Concurrency/ownership effects of added
Suggested reviewers
- mch2
- dbwiddis
- msfroh
Poem
🐰 I hop to every recovery gate,
I tap the clock—record the state.
When chunks arrive or listeners call,
I mark the time to guard them all.
A tiny hop, a timestamp small.
Pre-merge checks and finishing touches
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
✅ Passed checks (4 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title 'Fix peer recovery activity tracking' clearly and concisely describes the main change of the PR, which is to fix how peer recovery activity is tracked and prevent hung recoveries from being reset by polling. |
| Description check | ✅ Passed | The PR description fully covers the change objectives, links to the related issue #20177, explains the fix, and includes test confirmation, though API/documentation changes are marked as not applicable. |
| Linked Issues check | ✅ Passed | The PR successfully addresses the primary objective from #20177 by preventing recovery timeouts from being reset by polling and explicitly updating lastAccessTime only during meaningful work, enabling indices.recovery.recovery_activity_timeout to fail stuck recoveries. |
| Out of Scope Changes check | ✅ Passed | The PR includes changes to RemoteRoutingTableServiceTests for blob stream handling and CHANGELOG updates, which are narrowly scoped to support the main recovery timeout fix without introducing unrelated functionality. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
📜 Recent review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 87f83735eda460794e84ce3e08ae1da481c5e64a and 1d46b8be1e20d8b67cf9f5e9a8535e49d69cd574.
📒 Files selected for processing (6)
CHANGELOG.md(1 hunks)server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java(3 hunks)server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java(1 hunks)server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java(0 hunks)server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java(2 hunks)server/src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java(2 hunks)
💤 Files with no reviewable changes (1)
- server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
🔇 Additional comments (9)
CHANGELOG.md (1)
18-18: LGTM! Changelog entry correctly documents the fix.The entry appropriately documents the resolution of issue #20177 regarding hung peer recovery blocking replica allocation.
server/src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java (2)
576-582: LGTM! Test fix properly creates fresh streams.The change from
thenReturntothenAnswerensures a freshInputStreamis created on eachreadBlobinvocation, preventing checksum/footer corruption from stream reuse.
614-616: LGTM! Consistent test fix for stream creation.This change applies the same fix as in
testGetAsyncIndexRoutingReadAction, ensuring fresh streams are created on each invocation.server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1)
332-332: LGTM! Correct placement of lastAccessTime update.The
setLastAccessTime()call is appropriately placed right before replication starts, ensuring the activity timeout tracks meaningful work rather than status polling.server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (3)
240-240: LGTM! Explicit timestamp update needed in doRecovery.This explicit
setLastAccessTime()is necessary sincedoRecoverydoes not callcreateOrFinishListener. The recovery initiation qualifies as meaningful work that should reset the activity timeout.
433-435: LGTM! Explicit timestamp update needed for primary context handoff.This explicit
setLastAccessTime()is necessary sinceHandoffPrimaryContextRequestHandlerdoes not usecreateOrFinishListener. The primary context handoff represents meaningful recovery work.
469-469: LGTM! Timestamp update needed for retry scenarios.While
createOrFinishListener(line 449) already updateslastAccessTime, this explicit call at line 469 is necessary for the retry logic. WhenperformTranslogOpsis invoked recursively due to mapping exceptions (lines 481-490), a freshrecoveryRefis obtained and the timestamp must be updated again to track the continued activity.server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (2)
227-229: LGTM! Conditional timestamp update prevents timeout extension on no-ops.The conditional update ensures
lastAccessTimeis only modified when a non-null listener indicates the request will actually be processed, not for duplicate replays or no-ops. This aligns with the PR objective to track meaningful work.
249-252: LGTM! Timestamp update correctly placed after null check.The
setLastAccessTime()call is now properly placed after the null check, ensuring the timestamp is updated only when the file chunk will actually be processed. This addresses the inconsistency flagged in previous reviews and aligns with the pattern established increateOrFinishListener.
Comment @coderabbitai help to get the list of available commands and usage tips.
:x: Gradle check result for 4263de5090a91541ae44427114276246e17a22c4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@mch2 Can you take a look at this?
:x: Gradle check result for 4263de5090a91541ae44427114276246e17a22c4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for dfe25597fa173811fd5e6ffadcfbc966634527df: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 71bd51f8aa351dcf59985434dcf7fa811f99ab39: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c57ea79578eac9f6d77b3e3cef7b595740ca5c20: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@maxlepikhin You've got a code formatting error:
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':server:spotlessJavaCheck'.
> The following files had format violations:
src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java
@@ -573,11 +573,13 @@
············compressor,
············Version.CURRENT
········);
-········when(blobContainer.readBlob(indexName)).thenAnswer(invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize(
-············clusterState.getRoutingTable().getIndicesRouting().get(indexName),
-············uploadedFileName,
-············compressor
-········).streamInput());
+········when(blobContainer.readBlob(indexName)).thenAnswer(
+············invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize(
+················clusterState.getRoutingTable().getIndicesRouting().get(indexName),
+················uploadedFileName,
+················compressor
+············).streamInput()
+········);
········TestCapturingListener<IndexRoutingTable>·listener·=·new·TestCapturingListener<>();
········CountDownLatch·latch·=·new·CountDownLatch(1);
Run './gradlew spotlessApply' to fix all violations.
@maxlepikhin You've got a code formatting error:
FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':server:spotlessJavaCheck'. > The following files had format violations: src/test/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableServiceTests.java @@ -573,11 +573,13 @@ ············compressor, ············Version.CURRENT ········); -········when(blobContainer.readBlob(indexName)).thenAnswer(invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize( -············clusterState.getRoutingTable().getIndicesRouting().get(indexName), -············uploadedFileName, -············compressor -········).streamInput()); +········when(blobContainer.readBlob(indexName)).thenAnswer( +············invocation·->·remoteIndexRoutingTable.indexRoutingTableFormat.serialize( +················clusterState.getRoutingTable().getIndicesRouting().get(indexName), +················uploadedFileName, +················compressor +············).streamInput() +········); ········TestCapturingListener<IndexRoutingTable>·listener·=·new·TestCapturingListener<>(); ········CountDownLatch·latch·=·new·CountDownLatch(1); Run './gradlew spotlessApply' to fix all violations.
Thanks, updated.
:white_check_mark: Gradle check result for b2fc04ca35d96301b3d135a5c7471c8f3551f183: SUCCESS
Codecov Report
:x: Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 73.30%. Comparing base (d47931e) to head (87f8373).
:warning: Report is 36 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| .../indices/replication/common/ReplicationTarget.java | 66.66% | 0 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #20178 +/- ##
=========================================
Coverage 73.30% 73.30%
- Complexity 71732 71770 +38
=========================================
Files 5793 5793
Lines 328056 328117 +61
Branches 47245 47257 +12
=========================================
+ Hits 240476 240532 +56
- Misses 68264 68316 +52
+ Partials 19316 19269 -47
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:x: Gradle check result for 87f83735eda460794e84ce3e08ae1da481c5e64a: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for 87f83735eda460794e84ce3e08ae1da481c5e64a: SUCCESS
I rebased and added a CHANGELOG entry
:x: Gradle check result for 1d46b8be1e20d8b67cf9f5e9a8535e49d69cd574: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
I rebased and added a CHANGELOG entry
@andrross looks like the CI failed, ownership = DCO or something else?
@maxlepikhin DCO is fine, looks like some sort of connection failure on test setup. I'll retry the tests.
Did you answer @mch2's question above?
Your logic makes sense, though I'm curious what is polling/resetting the last access time here. The internal monitor is calling get on ConcurrentHashMap here - so that shouldn't be resetting the timestamp.
api calls to cat recovery also wouldn't reset this time as its fetching and reading recoverystate.
Did you confirm that this fixes the problem you discovered?
Your logic makes sense, though I'm curious what is polling/resetting the last access time here. The internal monitor is calling
geton ConcurrentHashMap here - so that shouldn't be resetting the timestamp.api calls to cat recovery also wouldn't reset this time as its fetching and reading recoverystate.
@mch2 Here is Codex' reply (I asked it to find concrete examples of the call sites). It makes sense to me.
• You’re right that the monitor and cat recovery don’t mutate it. The resets were coming from the ReplicationRef constructor calling setLastAccessTime(), so every lookup bumped the clock—
even if no work followed. Examples:
- PeerRecoveryTargetService handlers (e.g., file chunk at server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java:578-582, translog ops at :447-459) did
getSafe(...) before checking for duplicates; that lookup alone refreshed the timestamp.
- Segment replication runner (server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java:322-333) bumped it just by fetching the ref.
We removed that implicit bump and now update lastAccessTime only when we actually handle the request (createOrFinishListener when it returns non-null, plus the few non-transport paths like
doRecovery). Polling via get/getSafe no longer resets the timer, which fixes the observed timeout issue.
:x: Gradle check result for 1d46b8be1e20d8b67cf9f5e9a8535e49d69cd574: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@maxlepikhin DCO is fine, looks like some sort of connection failure on test setup. I'll retry the tests.
Did you answer @mch2's question above?
Your logic makes sense, though I'm curious what is polling/resetting the last access time here. The internal monitor is calling get on ConcurrentHashMap here - so that shouldn't be resetting the timestamp. api calls to cat recovery also wouldn't reset this time as its fetching and reading recoverystate.
Did you confirm that this fixes the problem you discovered?
- Replied to @mch2 , missed that comment, apologies.
- Re repro, spent some time trying to get a cluster in minikube into that condition with network policy or iptables unsuccessfully. So no repro to test the fix.