neon
neon copied to clipboard
storage controller: optimize placement of tenant shards when a node comes back online
Already Done
Cases:
-
During creation of shards (tenant creation or split), carry some context between schedule() calls so that shards have a soft anti-affinity against places that other shards in the same tenant are placed.
-
In the background, if a shard has a "better" secondary location than its currently attached location, migrate the attachment to that secondary. Better is defined by the same shard-aware anti-affinity as in creation.
- For example, after splitting, most shards will have preferable locations on their secondaries because the primaries are all colliding on the parent shard's location. This kind of migration must be conditional on the secondary location having reasonably fresh content: no point starting a migration which will just have to wait minutes for the secondary to download things.
-
In the background, if a shard's secondary has a "better" location than its current location, change it. For example, after splitting and migrating the attachments of child shards to their secondary locations, this will clean up the secondary locations that are all on the parent shard's node.
To do
- In the background, if a shard has a "better" secondary location than its currently attached location, migrate the attachment to that secondary. Better is defined by the same shard-aware anti-affinity as in creation.
- For example, after a node failure and recovery, the failed node will have no attached locations at all, but plenty of secondary locations. Some of these secondary locations should be considered "better" and promoted to attachments, to balance attachments.
Status: the optimization for sharding is done and working.
The outstanding part is to add optimization for balancing, which would migrate work back to nodes after they are evacuated for a failure or during upgrade.
CC @VladLazar -- we should factor that in to plans around rolling restart hooks (#7387) -- we can either think of this as a continuous background optimization, or perhaps we don't need to be that general, if we just put a node into a particular state when it comes back online.l
migrate work back to nodes after they are evacuated for a failure
I think this is the only outstanding work item left for this ticket. Might be as simple as starting a fill background operation
when we handle a node coming back online in Service::node_configure.
I think this is the only outstanding work item left for this ticket. Might be as simple as starting a fill background operation when we handle a node coming back online in Service::node_configure.
Yes, that's likely the shortest route to the goal, although need to think about what happens when many nodes come back online around the same time.
Given that we schedule secondaries into a different AZ than attached locations, once https://github.com/neondatabase/neon/issues/8264 is done this ticket will be implicitly fixed -- after a node comes back online, shards that had a home AZ the same as the down node will start getting optimized back to it.