OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Multi-Writer Prevention : Conditional Upload Flow & Logic

Open x-INFiN1TY-x opened this issue 7 months ago • 2 comments

Related Issues/RFCs:

  • Implements the preferred solution ("Versioned & mutable metadata file with Conditional writes") from [RFC] Leverage Conditional APIs for Multi-Writer Detection in Remote Store Clusters (#17763).
  • Addresses data consistency concerns discussed in #6736 and #6737.

Problem Statement

OpenSearch clusters using remote-backed storage are susceptible to data inconsistencies when multiple primary shards concurrently attempt to upload segment metadata—particularly during network partitions or primary failovers. A stale (previously active) primary might overwrite metadata written by the newly promoted one. This risk undermines cluster safety and complicates automation in recovery flows.

Current multi-writer detection mechanisms are not robust enough to handle this reliably.


Solution Overview

This PR introduces ETag-based conditional writes to the remote segment metadata upload process. ETags (version identifiers from cloud storage systems like S3, GCS, or Azure) allow OpenSearch to safely coordinate access to shared resources. This mechanism ensures only the correct primary shard can write metadata, while stale primaries self-detect and fence themselves.

Key enhancements:

  1. ETag-Based Conditional Writes: Primary shards attach the known ETag to each metadata upload using the If-Match condition. If the ETag doesn't match the current version in the remote store, the write is rejected (HTTP 412 Precondition Failed).

  2. Fixed Metadata Filename: To enable ETag-based coordination, segment metadata is now always written to a fixed filename (e.g., "segment_metadata") instead of legacy dynamic filenames.

  3. Stale Primary Self-Fencing:

    • On promotion, the new primary performs a non-conditional (forced) metadata upload after clearing its local ETag knowledge. This updates the remote file and its ETag.
    • If the old primary tries to write using a stale ETag, the write fails. This triggers a controlled failShard() operation, fencing off the stale node.
  4. ETag Lifecycle Managed at Shard Level: IndexShard now caches the ETag for its segment metadata file and updates it based on the success/failure of remote operations.

This design shifts writer validation from OpenSearch into the remote store’s atomic operations—improving correctness and simplifying state coordination.


Key Implementation Details

IndexShard

  • Introduces a MetadataETagCache per shard to hold the latest known ETag.

  • Provides methods:

    • getMetadataETag()
    • updateMetadataETag()
    • clearMetadataETag()
  • On primary promotion, invokes initiateNonConditionalRemoteMetadataUpload():

    • Clears cached ETag to trigger an unconditional upload.
    • Performs an overwrite that establishes a new ETag and “claims” primary ownership.
    • Handles transient errors gracefully, relying on future refreshes to retry.

RemoteStoreRefreshListener

  • During each metadata upload:

    • Retrieves the current ETag from the shard.
    • Invokes uploadMetadata(...) with the ETag and a structured ActionListener.
  • On success: Updates shard’s cached ETag.

  • On Precondition Failed: Treats this as a stale primary detection, clears ETag, and calls failShard() for fencing.

  • Logs other failures without failing the shard.

RemoteSegmentStoreDirectory

  • Accepts a versionIdentifier (ETag) and enhanced ActionListener.

  • Constructs ConditionalWriteOptions based on the ETag:

    • If ETag is present → ifMatch
    • If ETag is null → unconditional upload
  • Always uses "segment_metadata" as the remote filename.

RemoteDirectory & BlobStore

  • copyFrom() method now takes ConditionalWriteOptions.
  • Passes them through to the underlying blobContainer.writeBlobConditionally(...) for storage-provider-specific handling.

Testing

Unit tests in RemoteSegmentStoreDirectoryTests have been expanded to verify:

  • ETag propagation and conditional write correctness.
  • Proper fencing behavior on ETag mismatches.
  • Correct switching between conditional and unconditional uploads.

Related Issues

Check List

  • [x] New functionality has been documented.
  • [x] Public documentation issue/PR created
  • [x] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


Visualizing the Changes

  1. Sequence Diagram: Stale Primary Self-Fencing Mechanism Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-173440 Demonstrates how an outdated ETag causes a 412 failure, triggering the stale primary’s self-initiated failShard().

  2. Architecture: Core Components for Conditional Metadata Upload Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-174623 Shows interaction flow between IndexShard, RemoteStoreRefreshListener, RemoteSegmentStoreDirectory, RemoteDirectory, and BlobStore, including ETag usage.

  3. Flowchart: New Primary’s Metadata Ownership Claim Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-173732 Shows how a new primary clears its ETag cache, performs a non-conditional upload, and updates its local ETag before assuming control.

x-INFiN1TY-x avatar Jun 15 '25 15:06 x-INFiN1TY-x

Please note that this PR depends on the following downstream changes, which are currently under review. Until they are merged, the Gradle build will fail:

opensearch-project/OpenSearch #18064

opensearch-project/OpenSearch #18092

opensearch-project/OpenSearch #18093

x-INFiN1TY-x avatar Jun 15 '25 15:06 x-INFiN1TY-x

:x: Gradle check result for f1024f357008dc0f8ef311c6c78dd70cf9b29b4e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 15 '25 15:06 github-actions[bot]

:x: Gradle check result for 4ef0bcdcba46ba3e87a2b7533eeb37a048a43c47: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 27 '25 07:06 github-actions[bot]

This PR is stalled because it has been open for 30 days with no activity.