neon
neon copied to clipboard
Implement time_travel_recover for Azure blob storage
The remote_storage code for Azure blob storage doesn't support the time_travel_recover function yet:
https://github.com/neondatabase/neon/blob/46f20308b0216a419a575c181bb1666f28b726fc/libs/remote_storage/src/azure_blob.rs#L505-L515
cc #5567
I've looked into this tonight and the list API provides version data. I'm just a bit confused about deletion markers, apparently there are none, but there is a deleted-time/DeletedTime value. How that interacts with versioning and soft delete, I don't know yet. I have to run some experiments.
According to the experiments, there is no indication of a version having been deleted, no deletion markers or anything. In other words, while we can restore the old version of the file, we don't know if it has been deleted in the meantime or not.
I also looked into the point in time recovery feature. Apparently it bases on the change feed, but at least for customers there is no index exposed that is per-prefix or something, only a time-indexed structure. Maybe there is internal per-prefix based indices, no clue. However, assuming the worst case, that's a bit sad because then the complexity of the restore operation grows with the size of the storage account, and it's not very scalable to recover tenants on huge regions.
There is a list of bugs/limitations of the point in time recovery feature here.
I think for the near future we should implement a wrapper over the builtin point in time recovery feature. Then, once it hits scaling problems we can think about building our own infrastructure that builds indices. Moving to paused as I will focus on different things this week.
Apparently the Azure SDK doesn't support it, so we'll have to do it on our own somehow.
Does this become simpler if we implement it by finding the index from a point in time, and then just undelete all the layers that are referenced by the index? That way we don't have to worry about deletion times etc for layers
Wrote up some thoughts on this. I think manually writing out deletion markers is the best solution.
builtin time travel recovery
If we aim for the simplest solution, I think it makes most sense to add support for Azure's builtin time travel recovery of arbitrary prefixes to the SDK. They use the change feed.
The main issue with that approach is explained above, it's unscalable: the change feed contains all changes in the storage container, so the more tenants we have in it, the slower recovery will get, as we need to sift through all irrelevant changes. The advantage is that it's correct, which for features like these should be the priority IMO.
We could recognize that Azure blob storage is non-scalable in general, for example there is global network limits while AWS only has such limits per sharded prefix, i.e. AWS automatically splits up very busy/large buckets in the background so that multiple non-scalable components are responsible for it, sharing the load, while Azure always requires one component to be present. We should try to confirm this with Microsoft.
I don't know when we will hit those network limits on our prod buckets, and whether the change feed has significantly lower scalability limits or not. probably yes though.
finding the index from a point in time, and then just undelete all the layers that are referenced by the index?
The same problem exists on a timeline level as well. Say there is timestamps A, B, where at A there was a timeline, at B that timeline was deleted. If we now recover to timestamp A, we will undelete the timeline. If we then recover to timestamp B, we will not know that the timeline has been deleted because that information has been destroyed.
There is more weird scenarios one can construct. Say there is timestamps A, B, C, D, E. Timestamps A and B are from normal traffic where there is two different generations, but timestamps C,D,E from recoveries. Timestamp C first recovers to A, so deletes the index part that existed at B. Timestamp D recovers to B, so recreates all index parts that existed at B. If you now in timestamp E wanted to time travel to timestamp C, you would actually still get the index parts from B.
The good news is that these scenarios all involve time travel recovery. The only place I can think of where we can hit the "undelete" issue outside of that is with future layers, where we first delete a future layer, then write it again. Not sure how much of a problem missing that one is.
We could think of just writing out manual deletion markers whenever we do undeletions.
manual deletion markers
what if we had a deletions/ prefix next to the tenants/ one, where we put empty files before we do deletions, i.e. we put deletions/tenants/5439b282a390368078777772773e10c4/timelines/dfc1e07a8bfd4d5f3be6d0c59b866586/index-part.json-0001.2025-03-13T13:16:21 before we delete tenants/5439b282a390368078777772773e10c4/timelines/dfc1e07a8bfd4d5f3be6d0c59b866586/index-part.json-0001.
we could obtain the deletion date by listing that prefix and parsing the paths.
we could either add an auto-deletion lifecycle policy for these markers, or delete deletion markers the moment we create them (there is some ordering questions wrt the actual deletion and which type of error case you prefer... maybe the scrubber could take care of it or something).
we could add these manual deletion markers either at all or a subset of deletions, or we could just backfill them whenever we do undeletions. Latter is the minimum required thing needed for correctness, but if we want to catch cases we hadn't thought of (as well as the future layers issue), we need to probably add markers for each layer file deletion.
This approach would be scalable and allows us to be generic (so we don't need to teach the low level time travel recovery code about our storage format).
I think we only need to cover the pageserver use case for now: it's totally fine to have extra files in the pageserver tenant file repo. I'll implement a version that undeletes all the files referenced by index_part in a point in the history.