neon Epic: rebalancing aka migration of a tenant between pageserver nodes

Motivation

We want to be able to assign tenants and timelines to the pageserver running on the appropriate EC2 instances, to be able to distribute the tenants and their workloads on EC2 instances fairly. This helps achieve the stable user query latencies, avoid noisy neighbor issues and when necessary perform maintenance and upgrades on the nodes where the pageserver runs.

See #985 for a similar issue on safekeepers

DoD

Add possibility to move tenant from one storage node to another. For now it is okay to require compute node restart to do that.

Tasks

This may be trickier than moving the safekeepers. Moving tenant between pageservers is more involved. At first we need to "attach" tenant to the new pageserver by loading it's data from S3; subscribing to one of the safekeepers that serve that timeline; restarting compute with new pageserver address (here we assume that safekeepers use that connection info too). I think first 2 steps should be handled by one attach_tenant(tenant_id, s3_info, [safekeepers]) command, which should be asynchronous.

[x] #896
[x] #897
[x] #898
[x] #900
[x] #899
[x] zenithdb/console#252
[x] zenithdb/console#814
[x] #1560
[ ] #1555
[ ] #901
[ ] #2444
[x] setup a new pageserver with bigger disk (ps-N+1) and migrate tenant from ps-2
[ ] once this is released, release S3 prefix on prod so that we only have one prefix on prod

Follow ups after v0 PR (#995): Callmemaybe related fixes can be postponed and needed features are planned to be implemented on top of Storage coordination: https://github.com/zenithdb/zenith/issues/1180

Follow ups (may be in separate Epics):

[x] #1171
[ ] support multiple pageservers in cli to simplify testing
[ ] races between branch creation and tenant migration
[ ] more extensive (randomized) tests with tenant migration (in e2e tests?)

Nov 16 '21 17:11 kelvich

Some more notes.

What is the sequence of actions to test moving a tenant to a different pageserver?

Init pageserver old
Create new tenant
Insert some data
Run a checkpoint
Wait for it to be uploaded to the remote storage
Bootstrap new pageserver
Wait for all the data to be downloaded
Execute callmemaybe so new pageserver starts replication from safekeeper
Wait for replication to catch up
Restart compute with changed pageserver address
Shut down tenant on the old pageserver

What bothers me the most is that it becomes possible for two pageservers to write checkpoints concurrently to the same s3 path. Though they'll probably write the same data I wouldnt rely on that, there might be a version upgrade and the format might diverge. So I think we shouldn't allow that. This case can be guarded with some flags like when we attach timeline to a different pageserver it shouldn't try to upload something to s3. But this is still quite tricky. We shouldn't crash because of OOM here because InMemoryLayers can now be swapped to disk but still we should manage this somehow. I'll investigate possible solutions

Dec 07 '21 18:12 LizardWizzard

Some more notes.

What is the sequence of actions to test moving a tenant to a different pageserver?

1. Init pageserver old

2. Create new tenant

3. Insert some data

4. Run a checkpoint

5. Wait for it to be uploaded to the remote storage

6. Bootstrap new pageserver

7. Wait for all the data to be downloaded

8. Execute callmemaybe so new pageserver starts replication from safekeeper

9. Wait for replication to catch up

10. Restart compute with changed pageserver address

11. Shut down tenant on the old pageserver

I think for a v0 you can just go from 7 to 10 directly. It would be a bigger performance hiccup for compute, but should be okay from the correctness standpoint. After that we can add more synchronization between steps 7-10. Note that you can connect pageserver to safekeeper without callmemaybe.

What bothers me the most is that it becomes possible for two pageservers to write checkpoints concurrently to the same s3 path. Though they'll probably write the same data I wouldnt rely on that, there might be a version upgrade and the format might diverge. So I think we shouldn't allow that. This case can be guarded with some flags like when we attach timeline to a different pageserver it shouldn't try to upload something to s3. But this is still quite tricky. We shouldn't crash because of OOM here because InMemoryLayers can now be swapped to disk but still we should manage this somehow.

OOM shouldn't be the case here since pageserver anyway can spill files to disk. We can have unconditional check that if we are trying to overwrite existing valid files on s3 we need to check crc/hashsum and do nothing if they match -- that should help with your case IIUC.

Dec 08 '21 07:12 kelvich

I think for a v0 you can just go from 7 to 10 directly

Yeah, currently I just insert sleep calls if something needs waiting with todos to replace with api calls for proper synchronization.

do nothing if they match

What if they don't? Currently there is an archiving mechanism on the way to land in #874 and I've raised similar concerns there too. Can there be some non determinism in the files layout? Can there be a different version of pageserver with some changes to file layout? So I think now we might want to avoid both: overwriting files in s3, ~~extracting s3 data to non empty local timeline~~ we do not download anything for timeline that is available locally. Though it is good to check for race conditions etc. What do you think?

I've created https://github.com/zenithdb/zenith/issues/971 so maybe it is better to continue discussion there

cc @SomeoneToIgnore

Dec 08 '21 12:12 LizardWizzard

Let me summarize whats left here:

There is a possible race with branch creation. So if branch is created when relocation is in progress it can be lost. There is a possible fix to track that on the client side. I e make some eventually consistent algorithm which will converge to proper branch set on target pageserver. This solution was implemented but was not committed to cloud repo.
Other issue is related to background operations that can corrupt state on remote storage (s3) if they're running on both pageservers simultaneously (gc/compaction)

This can be mitigated by suspending background operations before relocation. Corresponding issue: https://github.com/neondatabase/neon/issues/2740 I attempted to implement that, but with current state of tenant management thing its hard to do that reliably. PR with that attempt https://github.com/neondatabase/neon/pull/2665

This approach has some downsides, and I wrote an RFC with better proposal, but it is a way heavier change: https://github.com/neondatabase/neon/pull/2676. RFC involves some future problems which will occur when we start thinking about scaling one tenant to multiple pageservers

So I think its better to start with suspending background operations now

Dec 30 '22 12:12 LizardWizzard

Another bug that needs to be resolved: https://github.com/neondatabase/neon/issues/3478

Jan 30 '23 14:01 LizardWizzard

This ticket is quite old, and described an earlier migration approach with downtime, that already exists. Closing in favour of https://github.com/neondatabase/neon/issues/5199 for the ongoing work to enable seamless migration.

Sep 05 '23 08:09 jcsp

neon neon copied to clipboard

Epic: rebalancing aka migration of a tenant between pageserver nodes

Motivation

DoD

Tasks

neon
neon copied to clipboard