neon
neon copied to clipboard
Epic: storage controller (née sharding service)
Motivation
Enable deploying pageserver sharding into production.
Develop the code from https://github.com/neondatabase/neon/pull/6251 into a service we can deploy.
DoD
Implementation ideas
### Tasks to be able to deploy + use in staging
- [x] https://github.com/neondatabase/neon/pull/6468
- [ ] https://github.com/neondatabase/neon/pull/6471
- [ ] https://github.com/neondatabase/neon/pull/6394
- [ ] https://github.com/neondatabase/cloud/issues/9718
### Tasks to be production ready
- [x] Embed migrations in binary for ease of deployment
- [x] DB Connection pooling in persistence.rs
- [x] Clean up logs (spans etc)
- [x] Make scheduler more scalable (don't re-construct its state for every request that uses it)
- [ ] https://github.com/neondatabase/neon/issues/6847
- [ ] https://github.com/neondatabase/neon/issues/6876
- [x] Implement shard splitting (via https://github.com/neondatabase/neon/issues/6278)
- [x] Background schedule/reconcile to retry anything that has previously failed
- [x] Retry policy for HTTP client (e.g. handle 503s from /location_config)
- [ ] https://github.com/neondatabase/neon/issues/6844
- [ ] https://github.com/neondatabase/neon/issues/6878
- [ ] https://github.com/neondatabase/neon/issues/6875
- [x] Add observability API for tenants sufficient to implement "describe" CLI that shows most recent status/error for a tenant shard.
- [ ] https://github.com/neondatabase/neon/issues/7103
- [x] https://github.com/neondatabase/neon/pull/7114
- [ ] https://github.com/neondatabase/cloud/issues/10625
- [x] https://github.com/neondatabase/neon/pull/7088
- [x] Ensure helm chart isn't using rolling upgrades, to reduce risk of split brain
- [ ] https://github.com/neondatabase/neon/issues/7388
- [ ] https://github.com/neondatabase/neon/issues/7463
- [ ] https://github.com/neondatabase/neon/issues/6877
- [ ] https://github.com/neondatabase/neon/issues/6824
- [ ] Stress testing (integration test). Similar to location_conf_churn but for this service.
- [ ] Chaos self-testing mode (for enabling in staging). Background task that does arbitrary migrations, node drains, node failures, etc.
- [ ] Timeline creation/deletion vs. Reconciler in flight: must not send a request to an old node if a new node attach is in flight
### Miscellaneous/tech debt backlog
- [x] Add a "prod mode" that will refuse to run if auth isn't enabled (https://github.com/neondatabase/neon/pull/6585#discussion_r1476116622) (https://github.com/neondatabase/neon/pull/7105)
- [x] ~Put LocalEnv-using stuff behind a cfg(testing) macro~ We can't -- neon_local would break for anyone not using --testing
- [x] Ensure that when updating tenant conf via location config API, we don't spuriously bump generatinos
- [ ] https://github.com/neondatabase/neon/issues/7107
- [ ] https://github.com/neondatabase/neon/issues/7108
- [ ] https://github.com/neondatabase/neon/issues/6896
- [ ] Ensure that tenant config Duration/String fields are formatted consistently, to avoid spurious reconciliations (https://github.com/neondatabase/neon/pull/6329#discussion_r1450566336).
- [ ] Revisit delete API behavior: control plane retries delete until 404 (goapp/internal/client/psclient/httppageserver/httppageserver.go), so we can do away with the wrapping of retries in the storage controller if we like
- [ ] Once we have embedded migrations, make the helm chart work with a default values.yaml and remove `--excluded-charts` (see thread on https://github.com/neondatabase/helm-charts/pull/61)
Other related tasks and Epics
Status:
- Database persistence landed Friday
- APIs for control plane integration are under review today
- Work has started on deploying what we currently have into staging, to unblock integration https://github.com/neondatabase/cloud/issues/9718
- It's realistic to see this up and running in staging by end of week.
Storage controller is deployed on prod us-east-1. Teleport RDS connection is there, but manual.