neon icon indicating copy to clipboard operation
neon copied to clipboard

WIP: Upload partial segments

Open petuhovskiy opened this issue 1 year ago • 5 comments

Add support for backing up partial segments to remote storage. Disabled by default, can be enabled with --partial-backup-enabled.

Safekeeper timeline has a background task which is subscribed to commit_lsn and flush_lsn updates. After the partial segment was updated (flush_lsn was changed), the segment will be uploaded to S3 in about 15 minutes.

The filename format for partial segments is ZZZ_XXX_skYY.partial, where:

  • ZZZ – the segment name, like 000000010000000000000001
  • XXX – commit_lsn in hex format {:016X}, e.g. 00000000346BC568
  • YY – safekeeper_id, like 1

Each safekeeper will keep info about remote partial segments in its control file. Code updates state in the control file before doing any S3 operations. This way control file stores information about all potentially existing remote partial segments and can clean them up after uploading a newer version.

Closes #6336

petuhovskiy avatar Jan 30 '24 13:01 petuhovskiy

2772 tests run: 2628 passed, 0 failed, 144 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_compute_pageserver_connection_stress: debug

Postgres 14

  • test_get_tenant_size_with_multiple_branches: debug

Code coverage* (full report)

  • functions: 27.9% (6387 of 22856 functions)
  • lines: 46.8% (44997 of 96051 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
803ca680da59bf2ca7ff306fe3f8e874332c38f0 at 2024-04-03T15:35:11.778Z :recycle:

github-actions[bot] avatar Jan 30 '24 14:01 github-actions[bot]

This PR is not yet polished, but I want to have a pre-review for the main upload algo (the important part) before finishing the PR. In other words, the following steps:

  1. await the review for this draft version (the important parts)
  2. fix the feedback, polish, undraft
  3. quick review, merge, deploy to staging

So, waiting for review @arssher :) (not urgent)

petuhovskiy avatar Jan 30 '24 20:01 petuhovskiy

Hmm. During yesterday call I assumed that we always remove the previous version before uploading the next one (ever have max 1 uploaded); now looking at the code I see this is not the case. In this case, names clash (different file contents under same file name) is indeed not good, i.e. gc looks unsafe wrt to it. While it is fixable it is safer/simpler to just always have unique names by including term + flush_lsn there.

Having multiple segments offloaded is a bit more complicated but it gives kinda atomicity, i.e. we are never in the state where we don't have partial offloaded at all (if it ever has been offloaded), which seems to be good. Thought note that upload up to flush_lsn is anyway something not reliable: 1) it might be from wrong term history 2) in theory there might be multiple not committed segments.

arssher avatar Feb 09 '24 07:02 arssher

In this case, names clash (different file contents under same file name) is indeed not good, i.e. gc looks unsafe wrt to it. While it is fixable it is safer/simpler to just always have unique names by including term + flush_lsn there.

It is simple, but the filename becomes quite long, I'm afraid to hit S3 filename limit eventually. WDYT?

Thought note that upload up to flush_lsn is anyway something not reliable: 1) it might be from wrong term history

Is this any bad? For inactive timelines partial segment in S3 will match partial segment on disk eventually, even for uncommitted WAL.

  1. in theory there might be multiple not committed segments.

True, but the number of such timelines should be very close to zero. I'd not fix this in this PR, and if it becomes an issue this can be fixed later.

Other than that, any more ideas for the first version of "partial segments upload"?

petuhovskiy avatar Feb 19 '24 12:02 petuhovskiy

It is simple, but the filename becomes quite long, I'm afraid to hit S3 filename limit eventually. WDYT?

Should be ok, limit is 1024 in both s3 and azure blob storage. 2^64 in decimal is 10^19, enough space. So I'd at least have term + flush_lsn there, and if we reupload on commit_lsn change include commit_lsn as well.

Is this any bad? For inactive timelines partial segment in S3 will match partial segment on disk eventually, even for uncommitted WAL.

No, it's expected indeed.

True, but the number of such timelines should be very close to zero. I'd not fix this in this PR, and if it becomes an issue this can be fixed later.

Yeah, definitely not very important. We also may simply hold off partial upload if full is not done yet.

Other than that, any more ideas for the first version of "partial segments upload"?

Approach LGTM

arssher avatar Feb 19 '24 14:02 arssher

After looking at S3 lifecycle rules (https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html), I realized that it might be useful to distinct full WAL segments from partial segments.

I think I'm going to add a special tag to partial segments, so we could select them in a lifecycle rule.

petuhovskiy avatar Mar 13 '24 23:03 petuhovskiy

Ok, ready for review now.

Note that control file upgrade is not forward compatible, i.e. we cannot rollback safekeepers after deploy. Also, I tried to write a test for this PR, but faced problems with S3 inspection from python tests. Decided to test this on staging instead.

petuhovskiy avatar Mar 19 '24 13:03 petuhovskiy

LGTM. Let's put PR description to wal_backup_partial.rs head comment.

arssher avatar Mar 25 '24 14:03 arssher

@bayandin do you know how to make test_forward_compatibility pass? The breakage in forward compatibility is expected

petuhovskiy avatar Mar 25 '24 17:03 petuhovskiy

@bayandin do you know how to make test_forward_compatibility pass? The breakage in forward compatibility is expected

There's no good way to do so for now. Is is possible not to break forward compatibility, maybe? 😄 Handling it via label makes adding this label to all other PRs necessary until we release these changes, so we can miss some other unexpected breaking changes.

I'll think about what we can do here.

bayandin avatar Mar 25 '24 18:03 bayandin

@bayandin do you know how to make test_forward_compatibility pass? The breakage in forward compatibility is expected

The general answer for forward compatibility is to merge+release a change that knows how to read the new format, and then later merge+release a change that enables writing it -- this is what we do for pageserver stuff.

jcsp avatar Mar 26 '24 11:03 jcsp

A bit of context: control file is the file where safekeepers store all the data about timelines (all but WAL itself). Due to historic reasons, this is a binary file which is serialized directly from struct TimelinePersistentState. It means that any change to this structure (and dependants) will require an upgrade.

The issue here is that control files in safekeepers have always worked that way, and upgrade was always breaking forward compatibility. It seems that the last control file upgrade was 18 months ago, so we never had this issue before because test_forward_compatibility was probably created after that...

I dislike that currently control file upgrade requires a breaking change, but I see only two fixes for this:

  1. merge+release two times for the single change, and write 2x more code for handling control file rollback
  2. overhaul control files completely and use key-value serialization format for control files (e.g. protobuf, json...)

Both options are not simple and I don't think I want to work on them right now. Also it seems fine to allow forward incompatibility for safekeepers – in case of something goes wrong issues can always be fixed without rollback. Unfortunately I see that there's no good way to omit test_forward_compatibility in our current CI.

I'm not really sure what should be done next here..

petuhovskiy avatar Mar 26 '24 18:03 petuhovskiy

The last time in similar situation I just manually added exclusion to test_compatibility code.

arssher avatar Mar 26 '24 19:03 arssher

@bayandin is it ok if I add @pytest.mark.skip to test_forward_compatibility for a week?

petuhovskiy avatar Mar 27 '24 13:03 petuhovskiy

@bayandin is it ok if I add @pytest.mark.skip to test_forward_compatibility for a week?

If we do this, we won't catch any forward compatibility failures from other PRs. Let me come up with a better solution for this.

bayandin avatar Mar 27 '24 13:03 bayandin

Added @pytest.mark.xfail to test_forward_compatibility to unblock merge. Will remove it after the next release.

petuhovskiy avatar Apr 03 '24 13:04 petuhovskiy