Multi-Writer Prevention : Conditional Upload Flow & Logic
Related Issues/RFCs:
- Implements the preferred solution ("Versioned & mutable metadata file with Conditional writes") from [RFC] Leverage Conditional APIs for Multi-Writer Detection in Remote Store Clusters (#17763).
- Addresses data consistency concerns discussed in #6736 and #6737.
Problem Statement
OpenSearch clusters using remote-backed storage are susceptible to data inconsistencies when multiple primary shards concurrently attempt to upload segment metadata—particularly during network partitions or primary failovers. A stale (previously active) primary might overwrite metadata written by the newly promoted one. This risk undermines cluster safety and complicates automation in recovery flows.
Current multi-writer detection mechanisms are not robust enough to handle this reliably.
Solution Overview
This PR introduces ETag-based conditional writes to the remote segment metadata upload process. ETags (version identifiers from cloud storage systems like S3, GCS, or Azure) allow OpenSearch to safely coordinate access to shared resources. This mechanism ensures only the correct primary shard can write metadata, while stale primaries self-detect and fence themselves.
Key enhancements:
-
ETag-Based Conditional Writes: Primary shards attach the known ETag to each metadata upload using the
If-Matchcondition. If the ETag doesn't match the current version in the remote store, the write is rejected (HTTP 412 Precondition Failed). -
Fixed Metadata Filename: To enable ETag-based coordination, segment metadata is now always written to a fixed filename (e.g.,
"segment_metadata") instead of legacy dynamic filenames. -
Stale Primary Self-Fencing:
- On promotion, the new primary performs a non-conditional (forced) metadata upload after clearing its local ETag knowledge. This updates the remote file and its ETag.
- If the old primary tries to write using a stale ETag, the write fails. This triggers a controlled
failShard()operation, fencing off the stale node.
-
ETag Lifecycle Managed at Shard Level:
IndexShardnow caches the ETag for its segment metadata file and updates it based on the success/failure of remote operations.
This design shifts writer validation from OpenSearch into the remote store’s atomic operations—improving correctness and simplifying state coordination.
Key Implementation Details
IndexShard
-
Introduces a
MetadataETagCacheper shard to hold the latest known ETag. -
Provides methods:
-
getMetadataETag() -
updateMetadataETag() -
clearMetadataETag()
-
-
On primary promotion, invokes
initiateNonConditionalRemoteMetadataUpload():- Clears cached ETag to trigger an unconditional upload.
- Performs an overwrite that establishes a new ETag and “claims” primary ownership.
- Handles transient errors gracefully, relying on future refreshes to retry.
RemoteStoreRefreshListener
-
During each metadata upload:
- Retrieves the current ETag from the shard.
- Invokes
uploadMetadata(...)with the ETag and a structuredActionListener.
-
On success: Updates shard’s cached ETag.
-
On
Precondition Failed: Treats this as a stale primary detection, clears ETag, and callsfailShard()for fencing. -
Logs other failures without failing the shard.
RemoteSegmentStoreDirectory
-
Accepts a
versionIdentifier(ETag) and enhancedActionListener. -
Constructs
ConditionalWriteOptionsbased on the ETag:- If ETag is present →
ifMatch - If ETag is null → unconditional upload
- If ETag is present →
-
Always uses
"segment_metadata"as the remote filename.
RemoteDirectory & BlobStore
-
copyFrom()method now takesConditionalWriteOptions. - Passes them through to the underlying
blobContainer.writeBlobConditionally(...)for storage-provider-specific handling.
Testing
Unit tests in RemoteSegmentStoreDirectoryTests have been expanded to verify:
- ETag propagation and conditional write correctness.
- Proper fencing behavior on ETag mismatches.
- Correct switching between conditional and unconditional uploads.
Related Issues
-
Implements part of RFC #17763
-
Parent Meta Issue: [META] Implement Conditional APIs for Multi-Writer Detection
-
Supersedes #18065
Check List
- [x] New functionality has been documented.
- [x] Public documentation issue/PR created
- [x] Commits are signed per the DCO using --signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Visualizing the Changes
-
Sequence Diagram: Stale Primary Self-Fencing Mechanism
Demonstrates how an outdated ETag causes a 412 failure, triggering the stale primary’s self-initiated
failShard(). -
Architecture: Core Components for Conditional Metadata Upload
Shows interaction flow between
IndexShard,RemoteStoreRefreshListener,RemoteSegmentStoreDirectory,RemoteDirectory, andBlobStore, including ETag usage. -
Flowchart: New Primary’s Metadata Ownership Claim
Shows how a new primary clears its ETag cache, performs a non-conditional upload, and updates its local ETag before assuming control.
Please note that this PR depends on the following downstream changes, which are currently under review. Until they are merged, the Gradle build will fail:
opensearch-project/OpenSearch #18064
:x: Gradle check result for f1024f357008dc0f8ef311c6c78dd70cf9b29b4e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 4ef0bcdcba46ba3e87a2b7533eeb37a048a43c47: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
This PR is stalled because it has been open for 30 days with no activity.