OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Feature Request] Shard Level Snapshot Restore

Open linuxpi opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe

  • During snapshot restore, individual shards can fail during restore, leading to red index.
  • Although the index is red, other primaries which were able to restore successfully can still accept write and move ahead of the snapshot point in time.
  • Since one of the shards is still UNASSIGNED, which failed recovery, and is rejecting any writes
  • Today if the user wants to recover from this state, they have no other option than to DELETE the index and restore from snapshot again.
  • This leads to data loss as some of the shards, which were STARTED, already started accepting traffic

Describe the solution you'd like

  • During Snapshot Restore if only some of the shards have failed, we should allow restoring individual shards
  • This will allow user to trigger Snapshot Restore on the same index again and only the UNASSIGNED(failed) shards will start recovery again from scratch.
  • This prevent data loss if successfully recovered shards have accepted any writes and reduces time and effort to recover.

Related component

Storage:Snapshots

Describe alternatives you've considered

No response

Additional context

We recently saw this issue with a Remote Store enabled domain where during snapshot recovery uploads to remote store started to fail for a single shard which lead to 1 out of 5 shards to fail recovery

linuxpi avatar Feb 08 '24 12:02 linuxpi