neon
neon copied to clipboard
WIP: Upload partial segments
Add support for backing up partial segments to remote storage. Disabled by default, can be enabled with --partial-backup-enabled
.
Safekeeper timeline has a background task which is subscribed to commit_lsn
and flush_lsn
updates. After the partial segment was updated (flush_lsn
was changed), the segment will be uploaded to S3 in about 15 minutes.
The filename format for partial segments is ZZZ_XXX_skYY.partial
, where:
-
ZZZ
– the segment name, like000000010000000000000001
-
XXX
– commit_lsn in hex format{:016X}
, e.g.00000000346BC568
-
YY
– safekeeper_id, like1
Each safekeeper will keep info about remote partial segments in its control file. Code updates state in the control file before doing any S3 operations. This way control file stores information about all potentially existing remote partial segments and can clean them up after uploading a newer version.
Closes #6336
2772 tests run: 2628 passed, 0 failed, 144 skipped (full report)
Flaky tests (2)
Postgres 16
-
test_compute_pageserver_connection_stress
: debug
Postgres 14
-
test_get_tenant_size_with_multiple_branches
: debug
Code coverage* (full report)
-
functions
:27.9% (6387 of 22856 functions)
-
lines
:46.8% (44997 of 96051 lines)
* collected from Rust tests only
803ca680da59bf2ca7ff306fe3f8e874332c38f0 at 2024-04-03T15:35:11.778Z :recycle:
This PR is not yet polished, but I want to have a pre-review for the main upload algo (the important part) before finishing the PR. In other words, the following steps:
- await the review for this draft version (the important parts)
- fix the feedback, polish, undraft
- quick review, merge, deploy to staging
So, waiting for review @arssher :) (not urgent)
Hmm. During yesterday call I assumed that we always remove the previous version before uploading the next one (ever have max 1 uploaded); now looking at the code I see this is not the case. In this case, names clash (different file contents under same file name) is indeed not good, i.e. gc looks unsafe wrt to it. While it is fixable it is safer/simpler to just always have unique names by including term + flush_lsn there.
Having multiple segments offloaded is a bit more complicated but it gives kinda atomicity, i.e. we are never in the state where we don't have partial offloaded at all (if it ever has been offloaded), which seems to be good. Thought note that upload up to flush_lsn is anyway something not reliable: 1) it might be from wrong term history 2) in theory there might be multiple not committed segments.
In this case, names clash (different file contents under same file name) is indeed not good, i.e. gc looks unsafe wrt to it. While it is fixable it is safer/simpler to just always have unique names by including term + flush_lsn there.
It is simple, but the filename becomes quite long, I'm afraid to hit S3 filename limit eventually. WDYT?
Thought note that upload up to flush_lsn is anyway something not reliable: 1) it might be from wrong term history
Is this any bad? For inactive timelines partial segment in S3 will match partial segment on disk eventually, even for uncommitted WAL.
- in theory there might be multiple not committed segments.
True, but the number of such timelines should be very close to zero. I'd not fix this in this PR, and if it becomes an issue this can be fixed later.
Other than that, any more ideas for the first version of "partial segments upload"?
It is simple, but the filename becomes quite long, I'm afraid to hit S3 filename limit eventually. WDYT?
Should be ok, limit is 1024 in both s3 and azure blob storage. 2^64 in decimal is 10^19, enough space. So I'd at least have term + flush_lsn there, and if we reupload on commit_lsn change include commit_lsn as well.
Is this any bad? For inactive timelines partial segment in S3 will match partial segment on disk eventually, even for uncommitted WAL.
No, it's expected indeed.
True, but the number of such timelines should be very close to zero. I'd not fix this in this PR, and if it becomes an issue this can be fixed later.
Yeah, definitely not very important. We also may simply hold off partial upload if full is not done yet.
Other than that, any more ideas for the first version of "partial segments upload"?
Approach LGTM
After looking at S3 lifecycle rules (https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html), I realized that it might be useful to distinct full WAL segments from partial segments.
I think I'm going to add a special tag to partial segments, so we could select them in a lifecycle rule.
Ok, ready for review now.
Note that control file upgrade is not forward compatible, i.e. we cannot rollback safekeepers after deploy. Also, I tried to write a test for this PR, but faced problems with S3 inspection from python tests. Decided to test this on staging instead.
LGTM. Let's put PR description to wal_backup_partial.rs head comment.
@bayandin do you know how to make test_forward_compatibility
pass? The breakage in forward compatibility is expected
@bayandin do you know how to make
test_forward_compatibility
pass? The breakage in forward compatibility is expected
There's no good way to do so for now. Is is possible not to break forward compatibility, maybe? 😄 Handling it via label makes adding this label to all other PRs necessary until we release these changes, so we can miss some other unexpected breaking changes.
I'll think about what we can do here.
@bayandin do you know how to make test_forward_compatibility pass? The breakage in forward compatibility is expected
The general answer for forward compatibility is to merge+release a change that knows how to read the new format, and then later merge+release a change that enables writing it -- this is what we do for pageserver stuff.
A bit of context: control file is the file where safekeepers store all the data about timelines (all but WAL itself). Due to historic reasons, this is a binary file which is serialized directly from struct TimelinePersistentState
. It means that any change to this structure (and dependants) will require an upgrade.
The issue here is that control files in safekeepers have always worked that way, and upgrade was always breaking forward compatibility. It seems that the last control file upgrade was 18 months ago, so we never had this issue before because test_forward_compatibility
was probably created after that...
I dislike that currently control file upgrade requires a breaking change, but I see only two fixes for this:
- merge+release two times for the single change, and write 2x more code for handling control file rollback
- overhaul control files completely and use key-value serialization format for control files (e.g. protobuf, json...)
Both options are not simple and I don't think I want to work on them right now. Also it seems fine to allow forward incompatibility for safekeepers – in case of something goes wrong issues can always be fixed without rollback. Unfortunately I see that there's no good way to omit test_forward_compatibility
in our current CI.
I'm not really sure what should be done next here..
The last time in similar situation I just manually added exclusion to test_compatibility code.
@bayandin is it ok if I add @pytest.mark.skip
to test_forward_compatibility
for a week?
@bayandin is it ok if I add
@pytest.mark.skip
totest_forward_compatibility
for a week?
If we do this, we won't catch any forward compatibility failures from other PRs. Let me come up with a better solution for this.
Added @pytest.mark.xfail
to test_forward_compatibility
to unblock merge. Will remove it after the next release.