neon
neon copied to clipboard
Epic: pageserver disaster recovery workflow
Motivation
We want to provide a backup method to get back page content for the pageserver by applying the WAL for a specific timeline.
Prior work in #1169 focused on manual guides. We want to create automatted way instead to do the same.
DoD
There is an API (or maybe button) in the console where admins can manually trigger a branch's rebuild from WAL. The work would mainly comprise of pageserver changes though, and the initial goal is to get an API for the pageserver, specified and implemented.
Implementation ideas
See #5248 and the discussion it contains.
### Tasks
- [x] make `initdb_lsn` reproducible: #2592
- [x] merge the disaster recovery RFC: #5248
- [x] implement pageserver side: #5912
- [x] fix regression: #6007
- [ ] create python script to do DR with the http endpoint and test it on staging
- [x] create safekeeper API definition for copying: #5770
- [ ] #6091
- [ ] https://github.com/neondatabase/neon/issues/6226
- [ ] if timeline_id != load_existing_initdb's timeline ID: copy over the initdb.tar.zst
Other related tasks and Epics
- https://github.com/neondatabase/neon/issues/2605
- The guide (not public): https://www.notion.so/neondatabase/Storage-Recovery-from-WAL-d92c0aac0ebf40df892b938045d7d720
- https://github.com/neondatabase/cloud/issues/8233
notes from 6th Nov, storage team planning Arpad agreed with Arseny on the next steps safekeepers copy API got positive feedbacks, looking to implement soon, but this may be impeded by the offsite
notes from 27th of Nov:
- we're aiming to complete it this month on the pageserver side of things @arpad-m , please feel free to post your corrections / additions here
With the merging of #5912, the pageserver side is feature-complete. However, there is still work left to do, like addressing #6007 or the creation of a python script to hit the API.
left to add the python script to our repo
Status 2024-01-29:
- Python script is written, hasn't been tested in the field yet. Will be used today in staging.
- Change to delete initdbs has merged
- Doc is written (update to existing DR doc)
Update:
- Access control issues when testing, which may be resolved: need to check
- This week try again to validate procedure.