neon WIP: Upload partial segments

Add support for backing up partial segments to remote storage. Disabled by default, can be enabled with --partial-backup-enabled.

Safekeeper timeline has a background task which is subscribed to commit_lsn and flush_lsn updates. After the partial segment was updated (flush_lsn was changed), the segment will be uploaded to S3 in about 15 minutes.

The filename format for partial segments is ZZZ_XXX_skYY.partial, where:

ZZZ – the segment name, like 000000010000000000000001
XXX – commit_lsn in hex format {:016X}, e.g. 00000000346BC568
YY – safekeeper_id, like 1

Each safekeeper will keep info about remote partial segments in its control file. Code updates state in the control file before doing any S3 operations. This way control file stores information about all potentially existing remote partial segments and can clean them up after uploading a newer version.

Closes #6336

Jan 30 '24 13:01 petuhovskiy

2772 tests run: 2628 passed, 0 failed, 144 skipped (full report)

Flaky tests (2)

Postgres 16

test_compute_pageserver_connection_stress: debug

Postgres 14

test_get_tenant_size_with_multiple_branches: debug

Code coverage* (full report)

functions: 27.9% (6387 of 22856 functions)
lines: 46.8% (44997 of 96051 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
803ca680da59bf2ca7ff306fe3f8e874332c38f0 at 2024-04-03T15:35:11.778Z :recycle:}

Jan 30 '24 14:01 github-actions[bot]

This PR is not yet polished, but I want to have a pre-review for the main upload algo (the important part) before finishing the PR. In other words, the following steps:

await the review for this draft version (the important parts)
fix the feedback, polish, undraft
quick review, merge, deploy to staging

So, waiting for review @arssher :) (not urgent)

Jan 30 '24 20:01 petuhovskiy

Hmm. During yesterday call I assumed that we always remove the previous version before uploading the next one (ever have max 1 uploaded); now looking at the code I see this is not the case. In this case, names clash (different file contents under same file name) is indeed not good, i.e. gc looks unsafe wrt to it. While it is fixable it is safer/simpler to just always have unique names by including term + flush_lsn there.

Having multiple segments offloaded is a bit more complicated but it gives kinda atomicity, i.e. we are never in the state where we don't have partial offloaded at all (if it ever has been offloaded), which seems to be good. Thought note that upload up to flush_lsn is anyway something not reliable: 1) it might be from wrong term history 2) in theory there might be multiple not committed segments.

Feb 09 '24 07:02 arssher

In this case, names clash (different file contents under same file name) is indeed not good, i.e. gc looks unsafe wrt to it. While it is fixable it is safer/simpler to just always have unique names by including term + flush_lsn there.

It is simple, but the filename becomes quite long, I'm afraid to hit S3 filename limit eventually. WDYT?

Thought note that upload up to flush_lsn is anyway something not reliable: 1) it might be from wrong term history

Is this any bad? For inactive timelines partial segment in S3 will match partial segment on disk eventually, even for uncommitted WAL.

in theory there might be multiple not committed segments.

True, but the number of such timelines should be very close to zero. I'd not fix this in this PR, and if it becomes an issue this can be fixed later.

Other than that, any more ideas for the first version of "partial segments upload"?

Feb 19 '24 12:02 petuhovskiy

It is simple, but the filename becomes quite long, I'm afraid to hit S3 filename limit eventually. WDYT?

Should be ok, limit is 1024 in both s3 and azure blob storage. 2^64 in decimal is 10^19, enough space. So I'd at least have term + flush_lsn there, and if we reupload on commit_lsn change include commit_lsn as well.

Is this any bad? For inactive timelines partial segment in S3 will match partial segment on disk eventually, even for uncommitted WAL.

No, it's expected indeed.

True, but the number of such timelines should be very close to zero. I'd not fix this in this PR, and if it becomes an issue this can be fixed later.

Yeah, definitely not very important. We also may simply hold off partial upload if full is not done yet.

Other than that, any more ideas for the first version of "partial segments upload"?

Approach LGTM

Feb 19 '24 14:02 arssher

After looking at S3 lifecycle rules (https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html), I realized that it might be useful to distinct full WAL segments from partial segments.

I think I'm going to add a special tag to partial segments, so we could select them in a lifecycle rule.

Mar 13 '24 23:03 petuhovskiy

Ok, ready for review now.

Note that control file upgrade is not forward compatible, i.e. we cannot rollback safekeepers after deploy. Also, I tried to write a test for this PR, but faced problems with S3 inspection from python tests. Decided to test this on staging instead.

Mar 19 '24 13:03 petuhovskiy

LGTM. Let's put PR description to wal_backup_partial.rs head comment.

Mar 25 '24 14:03 arssher

@bayandin do you know how to make test_forward_compatibility pass? The breakage in forward compatibility is expected

Mar 25 '24 17:03 petuhovskiy

@bayandin do you know how to make test_forward_compatibility pass? The breakage in forward compatibility is expected

There's no good way to do so for now. Is is possible not to break forward compatibility, maybe? 😄 Handling it via label makes adding this label to all other PRs necessary until we release these changes, so we can miss some other unexpected breaking changes.

I'll think about what we can do here.

Mar 25 '24 18:03 bayandin

@bayandin do you know how to make test_forward_compatibility pass? The breakage in forward compatibility is expected

The general answer for forward compatibility is to merge+release a change that knows how to read the new format, and then later merge+release a change that enables writing it -- this is what we do for pageserver stuff.

Mar 26 '24 11:03 jcsp

A bit of context: control file is the file where safekeepers store all the data about timelines (all but WAL itself). Due to historic reasons, this is a binary file which is serialized directly from struct TimelinePersistentState. It means that any change to this structure (and dependants) will require an upgrade.

The issue here is that control files in safekeepers have always worked that way, and upgrade was always breaking forward compatibility. It seems that the last control file upgrade was 18 months ago, so we never had this issue before because test_forward_compatibility was probably created after that...

I dislike that currently control file upgrade requires a breaking change, but I see only two fixes for this:

merge+release two times for the single change, and write 2x more code for handling control file rollback
overhaul control files completely and use key-value serialization format for control files (e.g. protobuf, json...)

Both options are not simple and I don't think I want to work on them right now. Also it seems fine to allow forward incompatibility for safekeepers – in case of something goes wrong issues can always be fixed without rollback. Unfortunately I see that there's no good way to omit test_forward_compatibility in our current CI.

I'm not really sure what should be done next here..

Mar 26 '24 18:03 petuhovskiy

The last time in similar situation I just manually added exclusion to test_compatibility code.

Mar 26 '24 19:03 arssher

@bayandin is it ok if I add @pytest.mark.skip to test_forward_compatibility for a week?

Mar 27 '24 13:03 petuhovskiy

@bayandin is it ok if I add @pytest.mark.skip to test_forward_compatibility for a week?

If we do this, we won't catch any forward compatibility failures from other PRs. Let me come up with a better solution for this.

Mar 27 '24 13:03 bayandin

Added @pytest.mark.xfail to test_forward_compatibility to unblock merge. Will remove it after the next release.

Apr 03 '24 13:04 petuhovskiy

neon neon copied to clipboard

WIP: Upload partial segments

2772 tests run: 2628 passed, 0 failed, 144 skipped (full report)

Postgres 16

Postgres 14

Code coverage* (full report)

neon
neon copied to clipboard