neon
neon copied to clipboard
storage controller: disaster recovery from loss of database
If we lost the storage controller database, then we should be able to recover: all the tenant data is still present in S3.
We would have some time: pageserver emergency mode enables pageservers to run even if the storage controller isn't running.
One can discover tenant shards and their latest generations from S3 -- this is broadly what the scrubber already does.
So let's build two things:
- A mode for the scrubber that discovers tenants and their generations, and writes that to a file.
- A tool to import such a file into an empty storage controller database.
One challenge is figuring out which tenants in the bucket should be managed by the storage controller, vs. which ones are managed directly by the cloud control plane, or by some other storage controller. There isn't any information in S3 that indicates this. We could add it (e.g. in the IndexPart), but that's kind of invasive. It would be a better fit for a tenant-wide metadata file in S3 as/when we add such a thing.
As an interim measure, we could say "only sharded tenants are migrated to the storage controller" (this is the status quo), and provide a mode in the scrubber that reports shards + generations only for multi-sharded tenants.
The scrubber should also have a tenant ID filter, so that if we had a list of specific tenants that we wanted to recover, we could do that.