neon S3 based recovery

trafficstars

Adds a new time_travel_recover function to the RemoteStorage trait that allows time travel like functionality for S3 buckets, regardless of their content (it is not even pageserver related). It takes a different approach from this post that is more complicated.

It takes as input a prefix and a timestamp:

executes ListObjectVersions
deletes all object versions with timestamps before the target

The approach fulfills all the requirements laid out in #8233, but it has the shortcoming that it is an unrecoverable operation: in other words, the versions since the target timestamp are permanently deleted. in other words, you can use this functionality but you will lose all history. This is not a good property for a disaster recovery tool.

TODO:

Maybe implement a different approach: take two timestamps as input, both a "start of request" one, where the caller guarantees that no edits to recover from have been made after that date and a "target" one. The code would then list all versions, filter out any key that has versions with timestamps after start, and for the remaining keys, copy from the old version to the new location. This is slightly more complicated but it also fulfills the requirements and it doesn't do permanent deletion.

Part of https://github.com/neondatabase/cloud/issues/8233

Dec 15 '23 21:12 arpad-m

2310 tests run: 2218 passed, 0 failed, 92 skipped (full report)

Flaky tests (3)

Postgres 16

test_secondary_mode_eviction: release
test_statvfs_pressure_min_avail_bytes: debug

Postgres 15

test_crafted_wal_end[last_wal_record_crossing_segment]: release

Code coverage (full report)

functions: 54.4% (10746 of 19771 functions)
lines: 81.1% (60991 of 75193 lines)

_{The comment gets automatically updated with the latest test results
624ff1812f68dee20d11637b3f45e270f470eaff at 2024-01-25T15:38:43.099Z :recycle:}

Dec 15 '23 22:12 github-actions[bot]

first CI failure debug:

thread 's3_time_travel_recovery_works' panicked at libs/remote_storage/tests/test_real_s3.rs:215:5:
assertion `left == right` failed
  left: {RemotePath("test/path1"), RemotePath("test/path2")}
 right: {RemotePath("test/path2"), RemotePath("test/path3")}

release:

 thread 's3_time_travel_recovery_works' panicked at libs/remote_storage/tests/test_real_s3.rs:215:5:
assertion `left == right` failed
  left: {RemotePath("test/path2"), RemotePath("test/path1")}
 right: {RemotePath("test/path3"), RemotePath("test/path2")}

second CI failure debug:

at t2: {RemotePath("test/path2"), RemotePath("test/path3")}
after recovery to t2: {}
thread 's3_time_travel_recovery_works' panicked at libs/remote_storage/tests/test_real_s3.rs:211:5:
assertion `left == right` failed
  left: {RemotePath("test/path2"), RemotePath("test/path3")}
 right: {}

same goes for release:

 at t2: {RemotePath("test/path2"), RemotePath("test/path3")}
after recovery to t2: {}
thread 's3_time_travel_recovery_works' panicked at libs/remote_storage/tests/test_real_s3.rs:211:5:
assertion `left == right` failed
  left: {RemotePath("test/path2"), RemotePath("test/path3")}
 right: {}
stack backtrace:

In the last commit I have changed nothing that would make it fail earlier: I only increased the time to wait.

Dec 19 '23 13:12 arpad-m

I made it work locally, it still errors when I run it in the CI...

Dec 20 '23 02:12 arpad-m

The approach fulfills all the requirements laid out in #8233, but it has the shortcoming that it is an unrecoverable operation: in other words, the versions since the target timestamp are permanently deleted.

Deleting everything since the timestamp is too broad: especially for layer files, where they are totally inert if not referenced by an index: in DR there is no utility to deleting these, it is alway sufficient to restore an index that does not reference them.

There are two main classes of issue the tool addresses:

Incorrectly deleted layers, where the index is okay but the layer is gone. For this type is issue, we only need to delete deletion markers.
Corrupt metadata, such that we want to roll back the index to an earlier version, and make sure that older layers referenced by the older index are accessible.

Layers and other objects should probably be treated differently:

For layers, delete deletion markers only that are more recent than the target timestamp
For indices, roll back to a particular version of the object, but take copies of more recent versions for safety - e.g. by doing a CopyObject to some "recovery_trash/" prefix (either remotely in S3 or locally on the box where the tool is being run).

Some level of index-awareness would be helpful in the tool: e.g. when restoring to a particular index, walk the layers in the index and check that they are all present in S3: that would be a quick way to detect corruptions involving a bad link to a layer.

Dec 22 '23 16:12 jcsp

especially for layer files, where they are totally inert if not referenced by an index: in DR there is no utility to deleting these, it is alway sufficient to restore an index that does not reference them.

There is this paragraph in the epic: https://github.com/neondatabase/cloud/issues/8233

The outcome of a successful workflow execution is that the ListObjects output for the restored tenant is the same as it actually was at the point in time to which re restored, and each object has the same contents as it had at the point in time to which we restored.

If we don't do this, we introduce some extra configuration state to the S3 bucket, which is "some layer files and other data present from the future". That extra state needs to be reasoned about. It's fine now, but will it always do it in the future?

The other reason for why to just restore everything is that then this operation is generic and allows us to also use it for other DR scenarios. The pageserver is not the only component using S3 and we might want to DR other buckets or prefixes as well.

The current state of this PR represents v1 of S3 based recovery and is not the intended end state, and it's good to see that you agree that (permanent) deletion is not good, because v2 would avoid that. The approach for v2 I outlined here. It would just be a change of the implementation of the function, with a change of the API.

Dec 22 '23 17:12 arpad-m

I have now implemented the new approach outlined in https://github.com/neondatabase/cloud/issues/8233#issuecomment-1858540063 and the unit test is green.

I will remove some debug prints next, but reviews are welcome.

Jan 15 '24 20:01 arpad-m

Hmmm so CI is apparently also getting these failures (link):

Error: dispatch failure

Caused by:
    0: other
    1: connection closed before message completed

Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1963:27
      <remote_storage::s3_bucket::S3Bucket as remote_storage::RemoteStorage>::time_travel_recover::{{closure}}
             at ./src/s3_bucket.rs:654:20
   1: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9
      remote_storage::GenericRemoteStorage::time_travel_recover::{{closure}}
             at ./src/lib.rs:411:22

I'm not sure what is causing them, but I have seen them sometimes. Sometimes even during the setup stage.

Jan 17 '24 12:01 arpad-m

I'm not sure what is causing them, but I have seen them sometimes. Sometimes even during the setup stage.

I think these are normal s3 being s3 errors. With the semi-recent scrubber run in prod, I ended up pushing up the retries (#6218, later isolated that in some PR), but these could happen in all currently not retried places (upload stream -- which should not be retryable even if configured).

Unsure what is the best answer here. Introducing backoff::retry would balloon the test to few MB of code...

Jan 17 '24 12:01 koivunej

nextest would have retries for flaky tests: https://nexte.st/book/retries.html. Running azure and s3 tests with retries might work. (Did not check if they are already configured.)

However what has changed..? Introduction of more listing, but it doesn't feel like our CI runners could be executing these tests too often to cause an event like multiple pageserver restarts in a region.

Jan 17 '24 14:01 koivunej

Hmmm it seems to fail even after 4 tries:

    RETRY 4/4 [         ] remote_storage::test_real_s3 s3_time_travel_recovery_works
  TRY 4 FAIL [   6.153s] remote_storage::test_real_s3 s3_time_travel_recovery_works

(with the same "dispatch failure" error)

Jan 18 '24 10:01 arpad-m

Merged, follow-ups are in #6478 and #6479.

Jan 25 '24 17:01 arpad-m

neon neon copied to clipboard

S3 based recovery

2310 tests run: 2218 passed, 0 failed, 92 skipped (full report)

Postgres 16

Postgres 15

Code coverage (full report)

neon
neon copied to clipboard