OpenSearch [BUG] Remote-backed warm indices download excessive translog generations during index cold start recovery

Describe the bug

Description

Remote-backed warm indices with segment replication experience significantly slower index cold start recovery, downloading more translog generations than necessary for recovery from remote storage.

Environment

OpenSearch Version: 3.2
Configuration:
- Remote store: Enabled
- Segment replication: Enabled
- Composite directory: Enabled
- index.translog.durability: request
- index.translog.flush_threshold_period: 5m (flush on idle)

Related component

Storage:Remote

To Reproduce

Setup:
- Create remote-backed index with segment replication enabled
- Configure request-level durability (index.translog.durability = "request")
Index Data:
- Index ~100 GB of data with request-level durability
- This generates hundreds of translog generations (one per request with request-level durability)
Stop Indexing:
- Stop all indexing operations
- Index enters "warm" state (idle, no active writes)
- Wait for idle flush to occur (configured at 5 minutes in test, though timing may vary)
- Verify idle flush completed: Check that local translog directory contains only the current (empty) translog generation
  - All historical translog files should be cleaned up locally
  - Only one active translog generation remains (with no operations)
Simulate Cold Start Recovery:
- Stop OpenSearch process on all nodes
- Delete all local data directories on all nodes
- Restart OpenSearch on all nodes
- Cluster reforms and begins cold start recovery from remote store
  - Remote store contains: Segments + Translog + Metadata
  - Nodes have: No local data (to simulate index cold start from remote)
Observe Recovery:
- Monitor primary shard recovery (downloads segments + translog from remote store)
- Monitor translog download specifically
- Note: Translog download issue only affects primary shards

Expected behavior

After idle flush completes (before cluster restart):

All pending indexing operations should be flushed to Lucene segments
Local translog directory should be cleaned up:
- Historical translog files deleted locally
- Only current empty translog generation remains with 0 pending operations
This indicates all operations are now in segments, no translog replay needed

During primary shard cold start recovery:

Download Lucene segments from remote store
- Segments contain all indexed operations
Download only the current translog generation from remote store
- Expected: Download only the empty/current generation
- Rationale: All operations already in segments, no replay needed
- Or at most: Download last 1-2 generations if any operations not yet in segments
- Most time spent downloading segments (necessary)
- Minimal time on translog (1-2 generations only)

Additional Details

Plugins repository-azure plugin

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: Windows
Version 11

Additional context

Actual Behavior Observed

During primary shard cold start recovery:

Download Lucene segments from remote store
Download multiple old translog generations (incorrect)
- Example: Downloads ~100 generations when only 1 is needed
- Downloads old generations even though:
  - These files don't exist locally (were cleaned up)
  - All operations already in segments (no replay needed)

Key observation:

Local cleanup worked correctly (files deleted)
Segments contain all operations (flush worked correctly)
But remote translog transfer metadata still references older generation (stale metadata)
Recovery trusts remote metadata and as a results downloads unnecessary translog generations

For actively indexing (hot) indices, this issue doesn't manifest because every indexing request calls ensureSynced() which uploads fresh TranslogTransferMetadata, automatically correcting any staleness introduced when metadata is uploaded in rollTranslogGeneration() before trimUnreferencedReaders() executes. For warm/idle indices, once indexing stops, there are no subsequent ensureSynced() calls to update the remote metadata after trimUnreferencedReaders() removes old generations during idle flush, causing the staleness to persist indefinitely.

Dec 08 '25 18:12 skhiani

@sachinpkale @andrross

Dec 08 '25 18:12 skhiani

FYI @ankitkala

Dec 08 '25 21:12 andrross

FYI @ankitkala

@ankitkala

I have root caused the issue which we are observing to the sequence in which flush operations perform translog cleanup and remote metadata upload

The Core Issue

The current implementation performs operations in this order:

InternalEngine.flush() calls translogManager.rollTranslogGeneration()
- Which internally calls translog.rollGeneration()
  - The prepareAndUpload() method creates TranslogCheckpointTransferSnapshot which captures the current minTranslogGeneration at the time of upload - before subsequent cleanup operations advance it.
- Then calls translog.trimUnreferencedReaders()
  - This performs local cleanup but happens after metadata upload
InternalEngine.flush() then calls translogManager.trimUnreferencedReaders()
- This performs additional cleanup and advances minTranslogGeneration
- But no metadata upload occurs at this point

Result: Remote metadata contains minTranslogGeneration from before the final cleanup, making it stale.

Why This Only Affects Warm/Idle Indices

For actively indexing indices:

Continuous indexing operations call ensureSynced()
Each ensureSynced() uploads fresh metadata with the current minTranslogGeneration
This overwrites the stale metadata from the previous flush

For warm/idle/closed indices:

Indexing stops
Idle flush triggers or index is closed
- Metadata uploaded with minTranslogGeneration before final cleanup
- Cleanup advances minTranslogGeneration significantly
- No subsequent uploads to correct the metadata
Stale metadata persists indefinitely in remote store

Code References

1. InternalEngine.flush() - calls rollTranslogGeneration() and then trimUnreferencedReaders()

2. InternalTranslogManager.rollTranslogGeneration() - calls translog.rollGeneration() and then translog.trimUnreferencedReaders()

3. RemoteFsTranslog.upload() - where translog snapshot upload happens

Dec 09 '25 19:12 skhiani

Thanks for the analysis @skhiani. @gbbafna syncing metadata post translog cleanup should fix this right? Do you see any concerns?

Dec 10 '25 03:12 ankitkala

Thanks for the analysis @skhiani. @gbbafna syncing metadata post translog cleanup should fix this right? Do you see any concerns?

@ankitkala @gbbafna

Update on this

I've been able to resolve this issue by signaling RemoteFsTranslog when idle flush or shard close flush is happening, and then triggering an additional metadata upload during trimUnreferencedReaders().

Verification Results

I've verified that:

Metadata upload on idle/close flush: The additional metadata upload is happening correctly during onIdleFlush() and shard close flush() operations
Cold start recovery improvement: During index cold start recovery, only the latest required translog file is now being downloaded
Performance impact: For my setup, I see this has significantly reduced cold start latencies, bringing down time spent in translog download from multiple seconds to just a few milliseconds

Approach

1. Flush state tracking - Track when idle flush or shard close flush is in progress 2. Target generation tracking - Track the minimum generation that should be reflected in remote metadata 3. Conditional metadata upload - Trigger additional metadata upload during trimUnreferencedReaders() when flush operations detect stale remote metadata

Code Changes in RemoteFsTranslog

1. New state tracking fields

+ private final AtomicBoolean inIdleFlush = new AtomicBoolean(false);
+ private final AtomicBoolean inShardCloseFlush = new AtomicBoolean(false);
+ private final AtomicLong targetMetadataMinGeneration = new AtomicLong(-1);

2. Update syncNeeded() to detect pending metadata updates

public boolean syncNeeded() {
    try (ReleasableLock lock = readLock.acquire()) {
        return current.syncNeeded()
            || (maxRemoteTranslogGenerationUploaded + 1 < this.currentFileGeneration() && current.totalOperations() == 0)
            || (current.getLastSyncedCheckpoint().globalCheckpoint > globalCheckpointSynced)
+            || isTranslogRemoteMetadataUpdatePending();
    }
}

3. Modified trimUnreferencedReaders() to trigger rollGeneration and thereby metadata upload

public void trimUnreferencedReaders() throws IOException {
    trimUnreferencedReaders(false);

+    if (isTranslogRemoteMetadataUpdateRequired()) {
+       long minLocalReferencedTranslogGeneration = getMinFileGeneration();
+
+        if (targetMetadataMinGeneration.compareAndSet(-1, minLocalReferencedTranslogGeneration)) {
+            rollGeneration();
+       } 
+    }
}

4. Updated onUploadComplete() to reset target generation

public void onUploadComplete(TransferSnapshot transferSnapshot) throws IOException {
  ...
  ...
  
  logger.debug(
      "Successfully uploaded translog for primary term = {}, generation = {}, maxSeqNo = {}, minRemoteGenReferenced = {}",
      primaryTerm,
      generation,
      maxSeqNo,
      minRemoteGenReferenced
  );

+    long targetRemoteReferencedGen = targetMetadataMinGeneration.get();
+    if (targetRemoteReferencedGen != -1) {
+        logger.debug("onUploadComplete: minRemoteGenReferenced {} targetMetadataMinGeneration {}",
+                    minRemoteGenReferenced, targetRemoteReferencedGen);
+
+        if (minRemoteGenReferenced >= targetRemoteReferencedGen) {
+            logger. debug("targetMetadataMinGeneration being set to -1");
+            targetMetadataMinGeneration.set(-1);
+        }
+   }
}

5. Flush lifecycle hooks

@Override
public void beforeIdleFlush() {
    logger.debug("Triggered RemotefsTranslog beforeIdleFlush");
    inIdleFlush.set(true);
}

@Override
public void afterIdleFlush() {
    logger.debug("Triggered RemotefsTranslog afterIdleFlush");
    inIdleFlush.set(false);
}

@Override
public void beforeShardCloseFlush() {
    logger.debug("Triggered RemotefsTranslog beforeShardCloseFlush");
    inShardCloseFlush.set(true);
}

@Override
public void afterShardCloseFlush() {
    logger.debug("Triggered RemotefsTranslog afterShardCloseFlush");
    inShardCloseFlush.set(false);
}

6. Helper methods

private boolean isTranslogRemoteMetadataUpdatePending() {
    if (inIdleFlush.get() || inShardCloseFlush. get()) {
        return targetMetadataMinGeneration.get() != -1;
    }
    return false;
}

private boolean isTranslogRemoteMetadataUpdateRequired() {
    if (inIdleFlush.get() || inShardCloseFlush.get()) {
        long minLocalReferencedTranslogGeneration = getMinFileGeneration();

        logger.debug("isTranslogRemoteMetadataUpdateRequired: (minRemoteGenReferenced+1) {}," +
                     "minLocalReferencedTranslogGeneration {}",
                    minRemoteGenReferenced+1, minLocalReferencedTranslogGeneration);
        return (minRemoteGenReferenced + 1 < minLocalReferencedTranslogGeneration);
    }
    return false;
}

How It Works

Flush Detection: beforeIdleFlush() and beforeShardCloseFlush() set flags when flush operations begin
Staleness Check: During trimUnreferencedReaders(), check if remote metadata is stale (references older generations than local minimum)
Trigger Upload: If stale, set targetMetadataMinGeneration and call rollGeneration() to trigger metadata upload
Completion: onUploadComplete() verifies the target generation was reached and resets the flag

Request for Feedback

I'd like to raise a formal PR for this fix. Before doing so, I wanted to understand if there are there any edge cases or scenarios this approach might not handle properly?

Thanks for reviewing!

Dec 11 '25 12:12 skhiani

@skhiani Feel free to raise the PR for fix

Dec 13 '25 06:12 ankitkala

Hi @skhiani , have you explored setting cluster.remote_store.translog.max_readers value to a lower number than the default value of 100 ? This would flush the shard implicitly , decreasing the cold start recovery time.

Dec 17 '25 11:12 gbbafna

Hi @skhiani , have you explored setting cluster.remote_store.translog.max_readers value to a lower number than the default value of 100 ? This would flush the shard implicitly , decreasing the cold start recovery time.

Hi @gbbafna,

Thank you for the suggestion!

I did evaluate using cluster.remote_store.translog.max_readers as a potential solution as the current default was 1000, but it doesn't adequately addresses the issue Lowering max_readers limits the maximum stale generation gap to ~100 but doesn't eliminate the root cause.

Current Setting Constraints

Looking at the current code:

Default: 1000 translog readers
Minimum: 100 (enforced by MIN_CLUSTER_REMOTE_MAX_TRANSLOG_READERS)
Can be disabled: -1

The minimum value of 100 was introduced in PR #14027 to allow disabling reader-based flushing (via -1) while preventing overly aggressive flush behavior that could impact performance.

Why This Doesn't Adequately Address the Issue

With default (max_readers = 1000):

No reader-based flushes triggered, but other flush conditions (translog size, periodic timers) cause intermediate flushes naturally
After idle flush/ close flush, a gap exists between what remote metadata references and what actually exists locally
Result: Remote metadata can reference 100+ stale translog generations that don't exist locally

With minimum (max_readers = 100):

More frequent flushes due to reader count threshold (in addition to other triggers)
More intermediate flushes mean less time between the last flush and idle flush
The stale generation gap is limited to approximately 100 generations maximum by the setting
Result: Still downloading far more translog generations than necessary during cold start recovery

Dec 18 '25 11:12 skhiani

Thanks @skhiani for trying this out . What about reducing this value to a lower value (say 10) for warm , would that help ?

I am not entirely sure about adding hooks for inactive shard flush and would prefer to explore alternatives if any . Inactive and active shard flush should differ only on trigger basis and not on functional basis.

Dec 19 '25 10:12 gbbafna

Thanks @skhiani for trying this out . What about reducing this value to a lower value (say 10) for warm , would that help ?

@gbbafna , I tested reducing cluster.remote_store.translog.max_readers=10 (also removing MIN_CLUSTER_REMOTE_MAX_TRANSLOG_READERS=100 limit), but unfortunately this doesn't solve the stale metadata issue.

More Flushes Don't Improve Cleanup

Reducing max_readers triggers more frequent flushes, but cleanup is constrained by when minSeqNoToKeep can advance.

Flush triggered when readers.size reaches threshold
During flush, ongoing writes continue creating new translog generations
Segment commit completes → minSeqNoToKeep advances
trimUnreferencedReaders() can finally delete old generations based on new minSeqNoToKeep
But by this time, readers.size has already grown beyond threshold

Why cleanup is limited:

minSeqNoToKeep advances after segment commit completes. This creates an unavoidable window (steps 2-4) where:

New generations accumulate from ongoing writes
Old generations cannot yet be deleted (waiting for minSeqNoToKeep to advance)
Reader count grows faster than cleanup can occur

I am not entirely sure about adding hooks for inactive shard flush and would prefer to explore alternatives if any . Inactive and active shard flush should differ only on trigger basis and not on functional basis.

I understand the concern about uniform flush behavior. However, as I mentioned above, this issue is specific to warm/idle indices - hot indices with active indexing naturally correct the staleness through subsequent ensureSynced() calls, while warm indices have no correction mechanism once indexing stops.

Alternative Approach: Update Metadata After Trim

An alternative would be to upload fresh metadata after trimUnreferencedReaders() completes.

How it works:

After each trim advances minRemoteGenReferenced, upload fresh metadata reflecting the new state
This would work uniformly for all scenarios (active indexing, idle, close)
Even during active indexing, each trim would reduce the gap incrementally

Implementation Challenge: Metadata upload is currently tightly coupled with rollTranslogGeneration(). Would need to either:

Add TranslogTransferManager.uploadMetadataOnly() (new API), or
Call rollGeneration() after trim (creates empty generations)

Note: I haven't fully explored this approach yet and there may be side effects to consider.

My Recommendation

The idle/close flush approach directly solves the problem - it targets the exact scenario where the issue manifests (warm indices) using well-tested existing infrastructure.

The alternative approach could maintain uniform flush behavior and progressively reduce the number of stale translog generations in remote metadata (by updating metadata after each trim), though it would still have the node crash limitation (stale metadata if crash occurs before metadata upload after trim) and would require new APIs with potential side effects.

Dec 26 '25 10:12 skhiani