foundationdb
foundationdb copied to clipboard
Fault tolerance of the performant restore
The performant restore to be released in FDB 6.3 assumes that no restore worker can fail.
This assumption is reasonable because (1) in a strictly controlled cluster, process and hardware failure is rare. If restore can finish within a short time, it will not need to handle failure; (2) removing failure help reduce the software complexity and deliver the feature earlier; (3) if failure happens in a restore, it can simply restart.
This assumption is not ideal because (1) in a cluster that is not fully controlled by DBA, the hardware can have regular maintenance. Not allowing failure will add strict constraint on the maintenance schedule; (2) it will prevent the performant restore from running continuously to check the backup's sanity; (3) it increases complexity for DBA to operate the restore cluster.
We plan to add fault tolerance to the performance restore after it has reached the desired performance.
Based on the current architecture of the performant restore, adding fault tolerance should not be super hard, although not easy.
TODOs for supporting fault tolerance in the performance restore:
- [ ] Failure detection: Use the destination FDB cluster's coordinators to monitor the restore master; Use the restore master to monitor the restore workers. It is similar to how FDB handles failures before FDB 6.3.
- [ ] Progress checkpoint: Restore master checkpoints the progress of each version batch; each restore applier saves the key-ranges it has applied to DB in each version batch. The checkpoint data is saved in the special system key space in the destination DB;
- [ ] Recovery: At recovery, restore master needs to (1) pause/terminate any ongoing work on restore workers; (2) figure out the restore point from the checkpoints; (3) restore the left-over progress in the ongoing version batches and the version batches after that. Note: Recovery should try to avoid the same failure happen again in the next epoch.