Problem

walredo using block poll() system call, stalling the executor if the walredo process doesn't answer requests.

DoD

We use tokio to read from / write to the walredo pipes.

Impl

- [ ] https://github.com/neondatabase/neon/pull/6573
- [ ] https://github.com/neondatabase/neon/pull/6568
- [ ] https://github.com/neondatabase/neon/issues/6648
- [ ] https://github.com/neondatabase/neon/pull/6548

Feb 05 '24 11:02 problame

How many bytes of RAM do we allow for single page reconstruction? I'm pausing #6881 because it would had made pageserver OOM more likely by yielding while applying, but of course, we could had already been near-OOM in the situations with logical replication slots before they were "fixed" by #6873.

What kind of concurrency limiter could we even use here... Because we cannot know how much get value reconstruct data would allocate before we actually do the search, should we just have a hard limit of 16MB per key? Well, this is probably the wrong issue to talk about it.

Feb 26 '24 08:02 koivunej

Initial concurrency limit I'm thinking of is number of vCPUs = executor threads.

Feb 26 '24 10:02 problame

Action this week: benchmark changes and progress PRs.

Mar 11 '24 11:03 jcsp

Merging these changes will block on writing an improved benchmark

Mar 18 '24 11:03 jcsp

Update:

benchmark that measures single-tenant scalability is merged
- async walredo improves up to 50% here: https://github.com/neondatabase/neon/pull/6548/files#diff-fbcb6117f0e507ed04478e6ef073af0b1a976f87111b04af1a9bf02e1bbb0106R33-R48
however, multi-tenant scalability and system behavior when we're CPU bottlenecked is still hard to understand and not quantified by that benchmark
=> shipping pageserver to single runtime to eliminate some uncertainty and make things simpler overall: https://github.com/neondatabase/neon/pull/6555

This week:

observe single runtime change in prod
rebase async walredo PR, figure out how to quantify multi-tenant scalability effects

Mar 25 '24 10:03 problame

Update:

observe single runtime change in prod

We reverted it (#7246) after prod-like cloudbench shows a ramp-up in walredo processes, presumably because compaction_loop didn't run. Some hypotheses / investigations in this thread https://neondb.slack.com/archives/C06K38EB05D/p1711451737661299?thread_ts=1711445451.915969&cid=C06K38EB05D

rebase async walredo PR

Didn't happen.

figure out how to quantify multi-tenant scalability effects

Did comparison benchmark between sync and async: https://www.notion.so/neondatabase/2024-03-25-async-walredo-benchmarking-2586fdb14d504996bc2192c1bba2f1ab?pvs=4

throughput: async delivers 8% better throughput (21.2k vs 22.9k)
tail latencies at well-defined throughput that is slightly below sweet spot of both implementations (20k)
- Results
- async has slightly higher tail latencies (p99.99 is 16.68ms vs 15.41ms with sync)
- Overall getpage latency is way more muddy with Sync than with Async (see the histogram screenshots)
- but walredo request-response latency is much cleaner with Sync (because there’s no competition for CPU)

Mar 28 '24 12:03 problame

Update for this week:

Single runtime change revert investigation will continue in this issue => https://github.com/neondatabase/neon/issues/7312 Spent 2h on it. Could work with peter to repro on a setup that I control, but, it'd likely consume multiple days. Alternative is to wait for prodlike cloud bench to work in us-east-2, then run a pre-merge build with just one runtime. Other alternative to explore is to make one-runtime a pageserver-config configurable. Would make it easier to play around with it, esp in staging.

While reviewing code, found a usage of std::sync::RwLock in walredo.rs that has potential to stall all executors while spawn is ongoing. => PR'ed (already merged): https://github.com/neondatabase/neon/pull/7310

Implemented runtime-configurability for sync vs async walredo: https://github.com/neondatabase/neon/pull/6548#issuecomment-2035401363 Its not maintainable because unlike the previous PR that switched from sync to async, the runtime-configurability is achieved by restoring the old code; so now we have two copies of it that vary only in

sync vs async
has walredo timeout vs has not

I suppose we could merge it and use the code for a few weeks to do experimentation / gradual rollout. I could add myself as required reviewer for CODEOWNERS to ensure it doesn't get out of sync. But after a few weeks, we should throw out the old code or revert back to sync.

Not happy with the progress on this, generally, I think we're being overly cautious.

Apr 04 '24 19:04 problame

Ongoing: gradual rollout of sync/async configurable code + switching on the async mode.

EOW goal: deploy across staging (a prod region next week)

Apr 15 '24 10:04 jcsp

Same plan this week: config change in preprod after the release, then gradually in prod.

Apr 22 '24 13:04 jcsp

Rollout to remaining regions happening this week.

May 06 '24 14:05 problame

Rollout went without problems.

This week: address the remaining task items, i.e.,

change default to async
make staging & prod use default
remove config option
remove sync impl

May 13 '24 13:05 problame

neon
neon copied to clipboard

Epic: async walredo

Problem

DoD

Related

Impl

neon neon copied to clipboard

Epic: async walredo

Problem

DoD

Related

Impl

neon
neon copied to clipboard