OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Remote-backed warm indices download excessive translog generations during index cold start recovery

Open skhiani opened this issue 3 weeks ago • 4 comments

Describe the bug

Description

Remote-backed warm indices with segment replication experience significantly slower index cold start recovery, downloading more translog generations than necessary for recovery from remote storage.

Environment

  • OpenSearch Version: 3.2
  • Configuration:
    • Remote store: Enabled
    • Segment replication: Enabled
    • Composite directory: Enabled
    • index.translog.durability: request
    • index.translog.flush_threshold_period: 5m (flush on idle)

Related component

Storage:Remote

To Reproduce

  1. Setup:

    • Create remote-backed index with segment replication enabled
    • Configure request-level durability (index.translog.durability = "request")
  2. Index Data:

    • Index ~100 GB of data with request-level durability
    • This generates hundreds of translog generations (one per request with request-level durability)
  3. Stop Indexing:

    • Stop all indexing operations
    • Index enters "warm" state (idle, no active writes)
    • Wait for idle flush to occur (configured at 5 minutes in test, though timing may vary)
    • Verify idle flush completed: Check that local translog directory contains only the current (empty) translog generation
      • All historical translog files should be cleaned up locally
      • Only one active translog generation remains (with no operations)
  4. Simulate Cold Start Recovery:

    • Stop OpenSearch process on all nodes
    • Delete all local data directories on all nodes
    • Restart OpenSearch on all nodes
    • Cluster reforms and begins cold start recovery from remote store
      • Remote store contains: Segments + Translog + Metadata
      • Nodes have: No local data (to simulate index cold start from remote)
  5. Observe Recovery:

    • Monitor primary shard recovery (downloads segments + translog from remote store)
    • Monitor translog download specifically
    • Note: Translog download issue only affects primary shards

Expected behavior

After idle flush completes (before cluster restart):

  • All pending indexing operations should be flushed to Lucene segments
  • Local translog directory should be cleaned up:
    • Historical translog files deleted locally
    • Only current empty translog generation remains with 0 pending operations
  • This indicates all operations are now in segments, no translog replay needed

During primary shard cold start recovery:

  • Download Lucene segments from remote store
    • Segments contain all indexed operations
  • Download only the current translog generation from remote store
    • Expected: Download only the empty/current generation
    • Rationale: All operations already in segments, no replay needed
    • Or at most: Download last 1-2 generations if any operations not yet in segments
    • Most time spent downloading segments (necessary)
    • Minimal time on translog (1-2 generations only)

Additional Details

Plugins repository-azure plugin

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: Windows
  • Version 11

Additional context

Actual Behavior Observed

During primary shard cold start recovery:

  • Download Lucene segments from remote store
  • Download multiple old translog generations (incorrect)
    • Example: Downloads ~100 generations when only 1 is needed
    • Downloads old generations even though:
      • These files don't exist locally (were cleaned up)
      • All operations already in segments (no replay needed)

Key observation:

  • Local cleanup worked correctly (files deleted)
  • Segments contain all operations (flush worked correctly)
  • But remote translog transfer metadata still references older generation (stale metadata)
  • Recovery trusts remote metadata and as a results downloads unnecessary translog generations

For actively indexing (hot) indices, this issue doesn't manifest because every indexing request calls ensureSynced() which uploads fresh TranslogTransferMetadata, automatically correcting any staleness introduced when metadata is uploaded in rollTranslogGeneration() before trimUnreferencedReaders() executes. For warm/idle indices, once indexing stops, there are no subsequent ensureSynced() calls to update the remote metadata after trimUnreferencedReaders() removes old generations during idle flush, causing the staleness to persist indefinitely.

skhiani avatar Dec 08 '25 18:12 skhiani

@sachinpkale @andrross

skhiani avatar Dec 08 '25 18:12 skhiani

FYI @ankitkala

andrross avatar Dec 08 '25 21:12 andrross

FYI @ankitkala

@ankitkala

I have root caused the issue which we are observing to the sequence in which flush operations perform translog cleanup and remote metadata upload

The Core Issue

The current implementation performs operations in this order:

  1. InternalEngine.flush() calls translogManager.rollTranslogGeneration()

    • Which internally calls translog.rollGeneration()

      • The prepareAndUpload() method creates TranslogCheckpointTransferSnapshot which captures the current minTranslogGeneration at the time of upload - before subsequent cleanup operations advance it.
    • Then calls translog.trimUnreferencedReaders()

      • This performs local cleanup but happens after metadata upload
  2. InternalEngine.flush() then calls translogManager.trimUnreferencedReaders()

    • This performs additional cleanup and advances minTranslogGeneration
    • But no metadata upload occurs at this point

Result: Remote metadata contains minTranslogGeneration from before the final cleanup, making it stale.

Why This Only Affects Warm/Idle Indices

For actively indexing indices:

  • Continuous indexing operations call ensureSynced()
  • Each ensureSynced() uploads fresh metadata with the current minTranslogGeneration
  • This overwrites the stale metadata from the previous flush

For warm/idle/closed indices:

  • Indexing stops
  • Idle flush triggers or index is closed
    • Metadata uploaded with minTranslogGeneration before final cleanup
    • Cleanup advances minTranslogGeneration significantly
    • No subsequent uploads to correct the metadata
  • Stale metadata persists indefinitely in remote store

Code References

1. InternalEngine.flush() - calls rollTranslogGeneration() and then trimUnreferencedReaders()

2. InternalTranslogManager.rollTranslogGeneration() - calls translog.rollGeneration() and then translog.trimUnreferencedReaders()

3. RemoteFsTranslog.upload() - where translog snapshot upload happens

skhiani avatar Dec 09 '25 19:12 skhiani

Thanks for the analysis @skhiani. @gbbafna syncing metadata post translog cleanup should fix this right? Do you see any concerns?

ankitkala avatar Dec 10 '25 03:12 ankitkala

Thanks for the analysis @skhiani. @gbbafna syncing metadata post translog cleanup should fix this right? Do you see any concerns?

@ankitkala @gbbafna

Update on this

I've been able to resolve this issue by signaling RemoteFsTranslog when idle flush or shard close flush is happening, and then triggering an additional metadata upload during trimUnreferencedReaders().

Verification Results

I've verified that:

  • Metadata upload on idle/close flush: The additional metadata upload is happening correctly during onIdleFlush() and shard close flush() operations
  • Cold start recovery improvement: During index cold start recovery, only the latest required translog file is now being downloaded
  • Performance impact: For my setup, I see this has significantly reduced cold start latencies, bringing down time spent in translog download from multiple seconds to just a few milliseconds

Approach

1. Flush state tracking - Track when idle flush or shard close flush is in progress 2. Target generation tracking - Track the minimum generation that should be reflected in remote metadata 3. Conditional metadata upload - Trigger additional metadata upload during trimUnreferencedReaders() when flush operations detect stale remote metadata

Code Changes in RemoteFsTranslog

1. New state tracking fields
+ private final AtomicBoolean inIdleFlush = new AtomicBoolean(false);
+ private final AtomicBoolean inShardCloseFlush = new AtomicBoolean(false);
+ private final AtomicLong targetMetadataMinGeneration = new AtomicLong(-1);
2. Update syncNeeded() to detect pending metadata updates
public boolean syncNeeded() {
    try (ReleasableLock lock = readLock.acquire()) {
        return current.syncNeeded()
            || (maxRemoteTranslogGenerationUploaded + 1 < this.currentFileGeneration() && current.totalOperations() == 0)
            || (current.getLastSyncedCheckpoint().globalCheckpoint > globalCheckpointSynced)
+            || isTranslogRemoteMetadataUpdatePending();
    }
}
3. Modified trimUnreferencedReaders() to trigger rollGeneration and thereby metadata upload
public void trimUnreferencedReaders() throws IOException {
    trimUnreferencedReaders(false);

+    if (isTranslogRemoteMetadataUpdateRequired()) {
+       long minLocalReferencedTranslogGeneration = getMinFileGeneration();
+
+        if (targetMetadataMinGeneration.compareAndSet(-1, minLocalReferencedTranslogGeneration)) {
+            rollGeneration();
+       } 
+    }
}
4. Updated onUploadComplete() to reset target generation
public void onUploadComplete(TransferSnapshot transferSnapshot) throws IOException {
  ...
  ...
  
  logger.debug(
      "Successfully uploaded translog for primary term = {}, generation = {}, maxSeqNo = {}, minRemoteGenReferenced = {}",
      primaryTerm,
      generation,
      maxSeqNo,
      minRemoteGenReferenced
  );

+    long targetRemoteReferencedGen = targetMetadataMinGeneration.get();
+    if (targetRemoteReferencedGen != -1) {
+        logger.debug("onUploadComplete: minRemoteGenReferenced {} targetMetadataMinGeneration {}",
+                    minRemoteGenReferenced, targetRemoteReferencedGen);
+
+        if (minRemoteGenReferenced >= targetRemoteReferencedGen) {
+            logger. debug("targetMetadataMinGeneration being set to -1");
+            targetMetadataMinGeneration.set(-1);
+        }
+   }
}
5. Flush lifecycle hooks
@Override
public void beforeIdleFlush() {
    logger.debug("Triggered RemotefsTranslog beforeIdleFlush");
    inIdleFlush.set(true);
}

@Override
public void afterIdleFlush() {
    logger.debug("Triggered RemotefsTranslog afterIdleFlush");
    inIdleFlush.set(false);
}

@Override
public void beforeShardCloseFlush() {
    logger.debug("Triggered RemotefsTranslog beforeShardCloseFlush");
    inShardCloseFlush.set(true);
}

@Override
public void afterShardCloseFlush() {
    logger.debug("Triggered RemotefsTranslog afterShardCloseFlush");
    inShardCloseFlush.set(false);
}
6. Helper methods
private boolean isTranslogRemoteMetadataUpdatePending() {
    if (inIdleFlush.get() || inShardCloseFlush. get()) {
        return targetMetadataMinGeneration.get() != -1;
    }
    return false;
}

private boolean isTranslogRemoteMetadataUpdateRequired() {
    if (inIdleFlush.get() || inShardCloseFlush.get()) {
        long minLocalReferencedTranslogGeneration = getMinFileGeneration();

        logger.debug("isTranslogRemoteMetadataUpdateRequired: (minRemoteGenReferenced+1) {}," +
                     "minLocalReferencedTranslogGeneration {}",
                    minRemoteGenReferenced+1, minLocalReferencedTranslogGeneration);
        return (minRemoteGenReferenced + 1 < minLocalReferencedTranslogGeneration);
    }
    return false;
}

How It Works

  1. Flush Detection: beforeIdleFlush() and beforeShardCloseFlush() set flags when flush operations begin
  2. Staleness Check: During trimUnreferencedReaders(), check if remote metadata is stale (references older generations than local minimum)
  3. Trigger Upload: If stale, set targetMetadataMinGeneration and call rollGeneration() to trigger metadata upload
  4. Completion: onUploadComplete() verifies the target generation was reached and resets the flag

Request for Feedback

I'd like to raise a formal PR for this fix. Before doing so, I wanted to understand if there are there any edge cases or scenarios this approach might not handle properly?

Thanks for reviewing!

skhiani avatar Dec 11 '25 12:12 skhiani

@skhiani Feel free to raise the PR for fix

ankitkala avatar Dec 13 '25 06:12 ankitkala

Hi @skhiani , have you explored setting cluster.remote_store.translog.max_readers value to a lower number than the default value of 100 ? This would flush the shard implicitly , decreasing the cold start recovery time.

gbbafna avatar Dec 17 '25 11:12 gbbafna

Hi @skhiani , have you explored setting cluster.remote_store.translog.max_readers value to a lower number than the default value of 100 ? This would flush the shard implicitly , decreasing the cold start recovery time.

Hi @gbbafna,

Thank you for the suggestion!

I did evaluate using cluster.remote_store.translog.max_readers as a potential solution as the current default was 1000, but it doesn't adequately addresses the issue Lowering max_readers limits the maximum stale generation gap to ~100 but doesn't eliminate the root cause.

Current Setting Constraints

Looking at the current code:

  • Default: 1000 translog readers
  • Minimum: 100 (enforced by MIN_CLUSTER_REMOTE_MAX_TRANSLOG_READERS)
  • Can be disabled: -1

The minimum value of 100 was introduced in PR #14027 to allow disabling reader-based flushing (via -1) while preventing overly aggressive flush behavior that could impact performance.

Why This Doesn't Adequately Address the Issue

With default (max_readers = 1000):

  • No reader-based flushes triggered, but other flush conditions (translog size, periodic timers) cause intermediate flushes naturally
  • After idle flush/ close flush, a gap exists between what remote metadata references and what actually exists locally
  • Result: Remote metadata can reference 100+ stale translog generations that don't exist locally

With minimum (max_readers = 100):

  • More frequent flushes due to reader count threshold (in addition to other triggers)
  • More intermediate flushes mean less time between the last flush and idle flush
  • The stale generation gap is limited to approximately 100 generations maximum by the setting
  • Result: Still downloading far more translog generations than necessary during cold start recovery

skhiani avatar Dec 18 '25 11:12 skhiani

Thanks @skhiani for trying this out . What about reducing this value to a lower value (say 10) for warm , would that help ?

I am not entirely sure about adding hooks for inactive shard flush and would prefer to explore alternatives if any . Inactive and active shard flush should differ only on trigger basis and not on functional basis.

gbbafna avatar Dec 19 '25 10:12 gbbafna

Thanks @skhiani for trying this out . What about reducing this value to a lower value (say 10) for warm , would that help ?

@gbbafna , I tested reducing cluster.remote_store.translog.max_readers=10 (also removing MIN_CLUSTER_REMOTE_MAX_TRANSLOG_READERS=100 limit), but unfortunately this doesn't solve the stale metadata issue.

More Flushes Don't Improve Cleanup

Reducing max_readers triggers more frequent flushes, but cleanup is constrained by when minSeqNoToKeep can advance.

  1. Flush triggered when readers.size reaches threshold
  2. During flush, ongoing writes continue creating new translog generations
  3. Segment commit completes → minSeqNoToKeep advances
  4. trimUnreferencedReaders() can finally delete old generations based on new minSeqNoToKeep
  5. But by this time, readers.size has already grown beyond threshold

Why cleanup is limited:

minSeqNoToKeep advances after segment commit completes. This creates an unavoidable window (steps 2-4) where:

  • New generations accumulate from ongoing writes
  • Old generations cannot yet be deleted (waiting for minSeqNoToKeep to advance)
  • Reader count grows faster than cleanup can occur

I am not entirely sure about adding hooks for inactive shard flush and would prefer to explore alternatives if any . Inactive and active shard flush should differ only on trigger basis and not on functional basis.

I understand the concern about uniform flush behavior. However, as I mentioned above, this issue is specific to warm/idle indices - hot indices with active indexing naturally correct the staleness through subsequent ensureSynced() calls, while warm indices have no correction mechanism once indexing stops.

Alternative Approach: Update Metadata After Trim

An alternative would be to upload fresh metadata after trimUnreferencedReaders() completes.

How it works:

  • After each trim advances minRemoteGenReferenced, upload fresh metadata reflecting the new state
  • This would work uniformly for all scenarios (active indexing, idle, close)
  • Even during active indexing, each trim would reduce the gap incrementally

Implementation Challenge: Metadata upload is currently tightly coupled with rollTranslogGeneration(). Would need to either:

  • Add TranslogTransferManager.uploadMetadataOnly() (new API), or
  • Call rollGeneration() after trim (creates empty generations)

Note: I haven't fully explored this approach yet and there may be side effects to consider.

My Recommendation

The idle/close flush approach directly solves the problem - it targets the exact scenario where the issue manifests (warm indices) using well-tested existing infrastructure.

The alternative approach could maintain uniform flush behavior and progressively reduce the number of stale translog generations in remote metadata (by updating metadata after each trim), though it would still have the node crash limitation (stale metadata if crash occurs before metadata upload after trim) and would require new APIs with potential side effects.

skhiani avatar Dec 26 '25 10:12 skhiani