neon
neon copied to clipboard
Epic: async walredo
Problem
walredo using block poll()
system call, stalling the executor if the walredo process doesn't answer requests.
DoD
We use tokio to read from / write to the walredo pipes.
Related
Do this after https://github.com/neondatabase/neon/issues/6565
Impl
- [ ] https://github.com/neondatabase/neon/pull/6573
- [ ] https://github.com/neondatabase/neon/pull/6568
- [ ] https://github.com/neondatabase/neon/issues/6648
- [ ] https://github.com/neondatabase/neon/pull/6548
How many bytes of RAM do we allow for single page reconstruction? I'm pausing #6881 because it would had made pageserver OOM more likely by yielding while applying, but of course, we could had already been near-OOM in the situations with logical replication slots before they were "fixed" by #6873.
What kind of concurrency limiter could we even use here... Because we cannot know how much get value reconstruct data would allocate before we actually do the search, should we just have a hard limit of 16MB per key? Well, this is probably the wrong issue to talk about it.
Initial concurrency limit I'm thinking of is number of vCPUs = executor threads.
- Action this week: benchmark changes and progress PRs.
Merging these changes will block on writing an improved benchmark
Update:
- benchmark that measures single-tenant scalability is merged
- async walredo improves up to 50% here: https://github.com/neondatabase/neon/pull/6548/files#diff-fbcb6117f0e507ed04478e6ef073af0b1a976f87111b04af1a9bf02e1bbb0106R33-R48
- however, multi-tenant scalability and system behavior when we're CPU bottlenecked is still hard to understand and not quantified by that benchmark
- => shipping pageserver to single runtime to eliminate some uncertainty and make things simpler overall: https://github.com/neondatabase/neon/pull/6555
This week:
- observe single runtime change in prod
- rebase async walredo PR, figure out how to quantify multi-tenant scalability effects
Update:
observe single runtime change in prod
We reverted it (#7246) after prod-like cloudbench shows a ramp-up in walredo processes, presumably because compaction_loop didn't run. Some hypotheses / investigations in this thread https://neondb.slack.com/archives/C06K38EB05D/p1711451737661299?thread_ts=1711445451.915969&cid=C06K38EB05D
rebase async walredo PR
Didn't happen.
figure out how to quantify multi-tenant scalability effects
Did comparison benchmark between sync and async: https://www.notion.so/neondatabase/2024-03-25-async-walredo-benchmarking-2586fdb14d504996bc2192c1bba2f1ab?pvs=4
- throughput: async delivers 8% better throughput (21.2k vs 22.9k)
- tail latencies at well-defined throughput that is slightly below sweet spot of both implementations (20k)
- Results
- async has slightly higher tail latencies (p99.99 is 16.68ms vs 15.41ms with sync)
- Overall getpage latency is way more muddy with Sync than with Async (see the histogram screenshots)
- but walredo request-response latency is much cleaner with Sync (because there’s no competition for CPU)
Update for this week:
Single runtime change revert investigation will continue in this issue => https://github.com/neondatabase/neon/issues/7312 Spent 2h on it. Could work with peter to repro on a setup that I control, but, it'd likely consume multiple days. Alternative is to wait for prodlike cloud bench to work in us-east-2, then run a pre-merge build with just one runtime. Other alternative to explore is to make one-runtime a pageserver-config configurable. Would make it easier to play around with it, esp in staging.
While reviewing code, found a usage of std::sync::RwLock in walredo.rs that has potential to stall all executors while spawn is ongoing. => PR'ed (already merged): https://github.com/neondatabase/neon/pull/7310
Implemented runtime-configurability for sync vs async walredo: https://github.com/neondatabase/neon/pull/6548#issuecomment-2035401363 Its not maintainable because unlike the previous PR that switched from sync to async, the runtime-configurability is achieved by restoring the old code; so now we have two copies of it that vary only in
- sync vs async
- has walredo timeout vs has not
I suppose we could merge it and use the code for a few weeks to do experimentation / gradual rollout. I could add myself as required reviewer for CODEOWNERS to ensure it doesn't get out of sync. But after a few weeks, we should throw out the old code or revert back to sync.
Not happy with the progress on this, generally, I think we're being overly cautious.
Ongoing: gradual rollout of sync/async configurable code + switching on the async mode.
EOW goal: deploy across staging (a prod region next week)
Same plan this week: config change in preprod after the release, then gradually in prod.
Rollout to remaining regions happening this week.
Rollout went without problems.
This week: address the remaining task items, i.e.,
- change default to async
- make staging & prod use default
- remove config option
- remove sync impl