[Feature Request] Shard Level Snapshot Restore

Open linuxpi opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe

During snapshot restore, individual shards can fail during restore, leading to red index.
Although the index is red, other primaries which were able to restore successfully can still accept write and move ahead of the snapshot point in time.
Since one of the shards is still UNASSIGNED, which failed recovery, and is rejecting any writes
Today if the user wants to recover from this state, they have no other option than to DELETE the index and restore from snapshot again.
This leads to data loss as some of the shards, which were STARTED, already started accepting traffic

Describe the solution you'd like

During Snapshot Restore if only some of the shards have failed, we should allow restoring individual shards
This will allow user to trigger Snapshot Restore on the same index again and only the UNASSIGNED(failed) shards will start recovery again from scratch.
This prevent data loss if successfully recovered shards have accepted any writes and reduces time and effort to recover.

Related component

Storage:Snapshots

Describe alternatives you've considered

No response

Additional context

We recently saw this issue with a Remote Store enabled domain where during snapshot recovery uploads to remote store started to fail for a single shard which lead to 1 out of 5 shards to fail recovery

Feb 08 '24 12:02 linuxpi