neon
neon copied to clipboard
docs: Graceful storage controller cluster restarts RFC
RFC for "Graceful Restarts of Storage Controller Managed Clusters".
POC which implements nearly everything mentioned here apart from the optimizations and some of the failure handling: https://github.com/neondatabase/neon/pull/7682
Related https://github.com/neondatabase/neon/issues/7387
2946 tests run: 2829 passed, 0 failed, 117 skipped (full report)
Code coverage* (full report)
functions:32.7% (6910 of 21129 functions)lines:50.1% (54195 of 108189 lines)
* collected from Rust tests only
8faa4377a36ad8cec9cf753c48da0df17710815e at 2024-07-01T09:46:16.764Z :recycle:
It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.
Right, in theory we can cater to non-HA tenants as well by changing their attached location. Depending on tenant size, this might be more disruptive than the restart since the pageserver we've moved to do will need to on-demand download the entire working set for the tenant. I can add this as a non-goal to the RFC.
Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.
This is basically the same scenario as previously. If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node. It's a good point though. I think it's a good idea to make sure the node is online before explicitly setting the attached location. Saves us a reconcile.
If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node.
yeah my point was mainly about that: there is a difference between doing retries indefinitely, and doing retries but giving up at a point. We need to implement latter because of the scenario I mentioned above.