OpenSearch
OpenSearch copied to clipboard
[Feature Request] Shard Level Snapshot Restore
Is your feature request related to a problem? Please describe
- During snapshot restore, individual shards can fail during restore, leading to red index.
- Although the index is red, other primaries which were able to restore successfully can still accept write and move ahead of the snapshot point in time.
- Since one of the shards is still UNASSIGNED, which failed recovery, and is rejecting any writes
- Today if the user wants to recover from this state, they have no other option than to DELETE the index and restore from snapshot again.
- This leads to data loss as some of the shards, which were STARTED, already started accepting traffic
Describe the solution you'd like
- During Snapshot Restore if only some of the shards have failed, we should allow restoring individual shards
- This will allow user to trigger Snapshot Restore on the same index again and only the UNASSIGNED(failed) shards will start recovery again from scratch.
- This prevent data loss if successfully recovered shards have accepted any writes and reduces time and effort to recover.
Related component
Storage:Snapshots
Describe alternatives you've considered
No response
Additional context
We recently saw this issue with a Remote Store enabled domain where during snapshot recovery uploads to remote store started to fail for a single shard which lead to 1 out of 5 shards to fail recovery