[BUG] Remote-backed warm indices download excessive translog generations during index cold start recovery
Describe the bug
Description
Remote-backed warm indices with segment replication experience significantly slower index cold start recovery, downloading more translog generations than necessary for recovery from remote storage.
Environment
- OpenSearch Version: 3.2
- Configuration:
- Remote store: Enabled
- Segment replication: Enabled
- Composite directory: Enabled
index.translog.durability:requestindex.translog.flush_threshold_period:5m(flush on idle)
Related component
Storage:Remote
To Reproduce
-
Setup:
- Create remote-backed index with segment replication enabled
- Configure request-level durability (
index.translog.durability = "request")
-
Index Data:
- Index ~100 GB of data with request-level durability
- This generates hundreds of translog generations (one per request with request-level durability)
-
Stop Indexing:
- Stop all indexing operations
- Index enters "warm" state (idle, no active writes)
- Wait for idle flush to occur (configured at 5 minutes in test, though timing may vary)
- Verify idle flush completed: Check that local translog directory contains only the current (empty) translog generation
- All historical translog files should be cleaned up locally
- Only one active translog generation remains (with no operations)
-
Simulate Cold Start Recovery:
- Stop OpenSearch process on all nodes
- Delete all local data directories on all nodes
- Restart OpenSearch on all nodes
- Cluster reforms and begins cold start recovery from remote store
- Remote store contains: Segments + Translog + Metadata
- Nodes have: No local data (to simulate index cold start from remote)
-
Observe Recovery:
- Monitor primary shard recovery (downloads segments + translog from remote store)
- Monitor translog download specifically
- Note: Translog download issue only affects primary shards
Expected behavior
After idle flush completes (before cluster restart):
- All pending indexing operations should be flushed to Lucene segments
- Local translog directory should be cleaned up:
- Historical translog files deleted locally
- Only current empty translog generation remains with 0 pending operations
- This indicates all operations are now in segments, no translog replay needed
During primary shard cold start recovery:
- Download Lucene segments from remote store
- Segments contain all indexed operations
- Download only the current translog generation from remote store
- Expected: Download only the empty/current generation
- Rationale: All operations already in segments, no replay needed
- Or at most: Download last 1-2 generations if any operations not yet in segments
- Most time spent downloading segments (necessary)
- Minimal time on translog (1-2 generations only)
Additional Details
Plugins repository-azure plugin
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: Windows
- Version 11
Additional context
Actual Behavior Observed
During primary shard cold start recovery:
- Download Lucene segments from remote store
- Download multiple old translog generations (incorrect)
- Example: Downloads ~100 generations when only 1 is needed
- Downloads old generations even though:
- These files don't exist locally (were cleaned up)
- All operations already in segments (no replay needed)
Key observation:
- Local cleanup worked correctly (files deleted)
- Segments contain all operations (flush worked correctly)
- But remote translog transfer metadata still references older generation (stale metadata)
- Recovery trusts remote metadata and as a results downloads unnecessary translog generations
For actively indexing (hot) indices, this issue doesn't manifest because every indexing request calls ensureSynced() which uploads fresh TranslogTransferMetadata, automatically correcting any staleness introduced when metadata is uploaded in rollTranslogGeneration() before trimUnreferencedReaders() executes. For warm/idle indices, once indexing stops, there are no subsequent ensureSynced() calls to update the remote metadata after trimUnreferencedReaders() removes old generations during idle flush, causing the staleness to persist indefinitely.
@sachinpkale @andrross
FYI @ankitkala
FYI @ankitkala
@ankitkala
I have root caused the issue which we are observing to the sequence in which flush operations perform translog cleanup and remote metadata upload
The Core Issue
The current implementation performs operations in this order:
-
InternalEngine.flush()callstranslogManager.rollTranslogGeneration()-
Which internally calls
translog.rollGeneration()- The prepareAndUpload() method creates TranslogCheckpointTransferSnapshot which captures the current minTranslogGeneration at the time of upload - before subsequent cleanup operations advance it.
-
Then calls
translog.trimUnreferencedReaders()- This performs local cleanup but happens after metadata upload
-
-
InternalEngine.flush()then callstranslogManager.trimUnreferencedReaders()- This performs additional cleanup and advances
minTranslogGeneration - But no metadata upload occurs at this point
- This performs additional cleanup and advances
Result: Remote metadata contains minTranslogGeneration from before the final cleanup, making it stale.
Why This Only Affects Warm/Idle Indices
For actively indexing indices:
- Continuous indexing operations call
ensureSynced() - Each
ensureSynced()uploads fresh metadata with the currentminTranslogGeneration - This overwrites the stale metadata from the previous flush
For warm/idle/closed indices:
- Indexing stops
- Idle flush triggers or index is closed
- Metadata uploaded with
minTranslogGenerationbefore final cleanup - Cleanup advances
minTranslogGenerationsignificantly - No subsequent uploads to correct the metadata
- Metadata uploaded with
- Stale metadata persists indefinitely in remote store
Code References
1. InternalEngine.flush() - calls rollTranslogGeneration() and then trimUnreferencedReaders()
3. RemoteFsTranslog.upload() - where translog snapshot upload happens
Thanks for the analysis @skhiani. @gbbafna syncing metadata post translog cleanup should fix this right? Do you see any concerns?
Thanks for the analysis @skhiani. @gbbafna syncing metadata post translog cleanup should fix this right? Do you see any concerns?
@ankitkala @gbbafna
Update on this
I've been able to resolve this issue by signaling RemoteFsTranslog when idle flush or shard close flush is happening, and then triggering an additional metadata upload during trimUnreferencedReaders().
Verification Results
I've verified that:
- Metadata upload on idle/close flush: The additional metadata upload is happening correctly during
onIdleFlush()andshard close flush()operations - Cold start recovery improvement: During index cold start recovery, only the latest required translog file is now being downloaded
- Performance impact: For my setup, I see this has significantly reduced cold start latencies, bringing down time spent in translog download from multiple seconds to just a few milliseconds
Approach
1. Flush state tracking - Track when idle flush or shard close flush is in progress 2. Target generation tracking - Track the minimum generation that should be reflected in remote metadata 3. Conditional metadata upload - Trigger additional metadata upload during trimUnreferencedReaders() when flush operations detect stale remote metadata
Code Changes in RemoteFsTranslog
1. New state tracking fields
+ private final AtomicBoolean inIdleFlush = new AtomicBoolean(false);
+ private final AtomicBoolean inShardCloseFlush = new AtomicBoolean(false);
+ private final AtomicLong targetMetadataMinGeneration = new AtomicLong(-1);
2. Update syncNeeded() to detect pending metadata updates
public boolean syncNeeded() {
try (ReleasableLock lock = readLock.acquire()) {
return current.syncNeeded()
|| (maxRemoteTranslogGenerationUploaded + 1 < this.currentFileGeneration() && current.totalOperations() == 0)
|| (current.getLastSyncedCheckpoint().globalCheckpoint > globalCheckpointSynced)
+ || isTranslogRemoteMetadataUpdatePending();
}
}
3. Modified trimUnreferencedReaders() to trigger rollGeneration and thereby metadata upload
public void trimUnreferencedReaders() throws IOException {
trimUnreferencedReaders(false);
+ if (isTranslogRemoteMetadataUpdateRequired()) {
+ long minLocalReferencedTranslogGeneration = getMinFileGeneration();
+
+ if (targetMetadataMinGeneration.compareAndSet(-1, minLocalReferencedTranslogGeneration)) {
+ rollGeneration();
+ }
+ }
}
4. Updated onUploadComplete() to reset target generation
public void onUploadComplete(TransferSnapshot transferSnapshot) throws IOException {
...
...
logger.debug(
"Successfully uploaded translog for primary term = {}, generation = {}, maxSeqNo = {}, minRemoteGenReferenced = {}",
primaryTerm,
generation,
maxSeqNo,
minRemoteGenReferenced
);
+ long targetRemoteReferencedGen = targetMetadataMinGeneration.get();
+ if (targetRemoteReferencedGen != -1) {
+ logger.debug("onUploadComplete: minRemoteGenReferenced {} targetMetadataMinGeneration {}",
+ minRemoteGenReferenced, targetRemoteReferencedGen);
+
+ if (minRemoteGenReferenced >= targetRemoteReferencedGen) {
+ logger. debug("targetMetadataMinGeneration being set to -1");
+ targetMetadataMinGeneration.set(-1);
+ }
+ }
}
5. Flush lifecycle hooks
@Override
public void beforeIdleFlush() {
logger.debug("Triggered RemotefsTranslog beforeIdleFlush");
inIdleFlush.set(true);
}
@Override
public void afterIdleFlush() {
logger.debug("Triggered RemotefsTranslog afterIdleFlush");
inIdleFlush.set(false);
}
@Override
public void beforeShardCloseFlush() {
logger.debug("Triggered RemotefsTranslog beforeShardCloseFlush");
inShardCloseFlush.set(true);
}
@Override
public void afterShardCloseFlush() {
logger.debug("Triggered RemotefsTranslog afterShardCloseFlush");
inShardCloseFlush.set(false);
}
6. Helper methods
private boolean isTranslogRemoteMetadataUpdatePending() {
if (inIdleFlush.get() || inShardCloseFlush. get()) {
return targetMetadataMinGeneration.get() != -1;
}
return false;
}
private boolean isTranslogRemoteMetadataUpdateRequired() {
if (inIdleFlush.get() || inShardCloseFlush.get()) {
long minLocalReferencedTranslogGeneration = getMinFileGeneration();
logger.debug("isTranslogRemoteMetadataUpdateRequired: (minRemoteGenReferenced+1) {}," +
"minLocalReferencedTranslogGeneration {}",
minRemoteGenReferenced+1, minLocalReferencedTranslogGeneration);
return (minRemoteGenReferenced + 1 < minLocalReferencedTranslogGeneration);
}
return false;
}
How It Works
- Flush Detection:
beforeIdleFlush()andbeforeShardCloseFlush()set flags when flush operations begin - Staleness Check: During
trimUnreferencedReaders(), check if remote metadata is stale (references older generations than local minimum) - Trigger Upload: If stale, set
targetMetadataMinGenerationand callrollGeneration()to trigger metadata upload - Completion:
onUploadComplete()verifies the target generation was reached and resets the flag
Request for Feedback
I'd like to raise a formal PR for this fix. Before doing so, I wanted to understand if there are there any edge cases or scenarios this approach might not handle properly?
Thanks for reviewing!
@skhiani Feel free to raise the PR for fix
Hi @skhiani , have you explored setting cluster.remote_store.translog.max_readers value to a lower number than the default value of 100 ? This would flush the shard implicitly , decreasing the cold start recovery time.
Hi @skhiani , have you explored setting
cluster.remote_store.translog.max_readersvalue to a lower number than the default value of 100 ? This would flush the shard implicitly , decreasing the cold start recovery time.
Hi @gbbafna,
Thank you for the suggestion!
I did evaluate using cluster.remote_store.translog.max_readers as a potential solution as the current default was 1000, but it doesn't adequately addresses the issue Lowering max_readers limits the maximum stale generation gap to ~100 but doesn't eliminate the root cause.
Current Setting Constraints
Looking at the current code:
- Default:
1000translog readers - Minimum:
100(enforced byMIN_CLUSTER_REMOTE_MAX_TRANSLOG_READERS) - Can be disabled:
-1
The minimum value of 100 was introduced in PR #14027 to allow disabling reader-based flushing (via -1) while preventing overly aggressive flush behavior that could impact performance.
Why This Doesn't Adequately Address the Issue
With default (max_readers = 1000):
- No reader-based flushes triggered, but other flush conditions (translog size, periodic timers) cause intermediate flushes naturally
- After idle flush/ close flush, a gap exists between what remote metadata references and what actually exists locally
- Result: Remote metadata can reference 100+ stale translog generations that don't exist locally
With minimum (max_readers = 100):
- More frequent flushes due to reader count threshold (in addition to other triggers)
- More intermediate flushes mean less time between the last flush and idle flush
- The stale generation gap is limited to approximately 100 generations maximum by the setting
- Result: Still downloading far more translog generations than necessary during cold start recovery
Thanks @skhiani for trying this out . What about reducing this value to a lower value (say 10) for warm , would that help ?
I am not entirely sure about adding hooks for inactive shard flush and would prefer to explore alternatives if any . Inactive and active shard flush should differ only on trigger basis and not on functional basis.
Thanks @skhiani for trying this out . What about reducing this value to a lower value (say 10) for warm , would that help ?
@gbbafna , I tested reducing cluster.remote_store.translog.max_readers=10 (also removing MIN_CLUSTER_REMOTE_MAX_TRANSLOG_READERS=100 limit), but unfortunately this doesn't solve the stale metadata issue.
More Flushes Don't Improve Cleanup
Reducing max_readers triggers more frequent flushes, but cleanup is constrained by when minSeqNoToKeep can advance.
- Flush triggered when
readers.sizereaches threshold - During flush, ongoing writes continue creating new translog generations
- Segment commit completes →
minSeqNoToKeepadvances trimUnreferencedReaders()can finally delete old generations based on newminSeqNoToKeep- But by this time,
readers.sizehas already grown beyond threshold
Why cleanup is limited:
minSeqNoToKeep advances after segment commit completes. This creates an unavoidable window (steps 2-4) where:
- New generations accumulate from ongoing writes
- Old generations cannot yet be deleted (waiting for
minSeqNoToKeepto advance) - Reader count grows faster than cleanup can occur
I am not entirely sure about adding hooks for inactive shard flush and would prefer to explore alternatives if any . Inactive and active shard flush should differ only on trigger basis and not on functional basis.
I understand the concern about uniform flush behavior. However, as I mentioned above, this issue is specific to warm/idle indices - hot indices with active indexing naturally correct the staleness through subsequent ensureSynced() calls, while warm indices have no correction mechanism once indexing stops.
Alternative Approach: Update Metadata After Trim
An alternative would be to upload fresh metadata after trimUnreferencedReaders() completes.
How it works:
- After each trim advances
minRemoteGenReferenced, upload fresh metadata reflecting the new state - This would work uniformly for all scenarios (active indexing, idle, close)
- Even during active indexing, each trim would reduce the gap incrementally
Implementation Challenge:
Metadata upload is currently tightly coupled with rollTranslogGeneration(). Would need to either:
- Add
TranslogTransferManager.uploadMetadataOnly()(new API), or - Call
rollGeneration()after trim (creates empty generations)
Note: I haven't fully explored this approach yet and there may be side effects to consider.
My Recommendation
The idle/close flush approach directly solves the problem - it targets the exact scenario where the issue manifests (warm indices) using well-tested existing infrastructure.
The alternative approach could maintain uniform flush behavior and progressively reduce the number of stale translog generations in remote metadata (by updating metadata after each trim), though it would still have the node crash limitation (stale metadata if crash occurs before metadata upload after trim) and would require new APIs with potential side effects.