neon icon indicating copy to clipboard operation
neon copied to clipboard

Epic: async walredo

Open problame opened this issue 1 year ago • 2 comments

Problem

walredo using block poll() system call, stalling the executor if the walredo process doesn't answer requests.

DoD

We use tokio to read from / write to the walredo pipes.

Related

Do this after https://github.com/neondatabase/neon/issues/6565

Impl

- [ ] https://github.com/neondatabase/neon/pull/6573
- [ ] https://github.com/neondatabase/neon/pull/6568
- [ ] https://github.com/neondatabase/neon/issues/6648
- [ ] https://github.com/neondatabase/neon/pull/6548

problame avatar Feb 05 '24 11:02 problame

How many bytes of RAM do we allow for single page reconstruction? I'm pausing #6881 because it would had made pageserver OOM more likely by yielding while applying, but of course, we could had already been near-OOM in the situations with logical replication slots before they were "fixed" by #6873.

What kind of concurrency limiter could we even use here... Because we cannot know how much get value reconstruct data would allocate before we actually do the search, should we just have a hard limit of 16MB per key? Well, this is probably the wrong issue to talk about it.

koivunej avatar Feb 26 '24 08:02 koivunej

Initial concurrency limit I'm thinking of is number of vCPUs = executor threads.

problame avatar Feb 26 '24 10:02 problame

  • Action this week: benchmark changes and progress PRs.

jcsp avatar Mar 11 '24 11:03 jcsp

Merging these changes will block on writing an improved benchmark

jcsp avatar Mar 18 '24 11:03 jcsp

Update:

  • benchmark that measures single-tenant scalability is merged
    • async walredo improves up to 50% here: https://github.com/neondatabase/neon/pull/6548/files#diff-fbcb6117f0e507ed04478e6ef073af0b1a976f87111b04af1a9bf02e1bbb0106R33-R48
  • however, multi-tenant scalability and system behavior when we're CPU bottlenecked is still hard to understand and not quantified by that benchmark
  • => shipping pageserver to single runtime to eliminate some uncertainty and make things simpler overall: https://github.com/neondatabase/neon/pull/6555

This week:

  • observe single runtime change in prod
  • rebase async walredo PR, figure out how to quantify multi-tenant scalability effects

problame avatar Mar 25 '24 10:03 problame

Update:

observe single runtime change in prod

We reverted it (#7246) after prod-like cloudbench shows a ramp-up in walredo processes, presumably because compaction_loop didn't run. Some hypotheses / investigations in this thread https://neondb.slack.com/archives/C06K38EB05D/p1711451737661299?thread_ts=1711445451.915969&cid=C06K38EB05D

rebase async walredo PR

Didn't happen.

figure out how to quantify multi-tenant scalability effects

Did comparison benchmark between sync and async: https://www.notion.so/neondatabase/2024-03-25-async-walredo-benchmarking-2586fdb14d504996bc2192c1bba2f1ab?pvs=4

  • throughput: async delivers 8% better throughput (21.2k vs 22.9k)
  • tail latencies at well-defined throughput that is slightly below sweet spot of both implementations (20k)
    • Results
    • async has slightly higher tail latencies (p99.99 is 16.68ms vs 15.41ms with sync)
    • Overall getpage latency is way more muddy with Sync than with Async (see the histogram screenshots)
    • but walredo request-response latency is much cleaner with Sync (because there’s no competition for CPU)

problame avatar Mar 28 '24 12:03 problame

Update for this week:


Single runtime change revert investigation will continue in this issue => https://github.com/neondatabase/neon/issues/7312 Spent 2h on it. Could work with peter to repro on a setup that I control, but, it'd likely consume multiple days. Alternative is to wait for prodlike cloud bench to work in us-east-2, then run a pre-merge build with just one runtime. Other alternative to explore is to make one-runtime a pageserver-config configurable. Would make it easier to play around with it, esp in staging.


While reviewing code, found a usage of std::sync::RwLock in walredo.rs that has potential to stall all executors while spawn is ongoing. => PR'ed (already merged): https://github.com/neondatabase/neon/pull/7310


Implemented runtime-configurability for sync vs async walredo: https://github.com/neondatabase/neon/pull/6548#issuecomment-2035401363 Its not maintainable because unlike the previous PR that switched from sync to async, the runtime-configurability is achieved by restoring the old code; so now we have two copies of it that vary only in

  • sync vs async
  • has walredo timeout vs has not

I suppose we could merge it and use the code for a few weeks to do experimentation / gradual rollout. I could add myself as required reviewer for CODEOWNERS to ensure it doesn't get out of sync. But after a few weeks, we should throw out the old code or revert back to sync.

Not happy with the progress on this, generally, I think we're being overly cautious.

problame avatar Apr 04 '24 19:04 problame

Ongoing: gradual rollout of sync/async configurable code + switching on the async mode.

EOW goal: deploy across staging (a prod region next week)

jcsp avatar Apr 15 '24 10:04 jcsp

Same plan this week: config change in preprod after the release, then gradually in prod.

jcsp avatar Apr 22 '24 13:04 jcsp

Rollout to remaining regions happening this week.

problame avatar May 06 '24 14:05 problame

Rollout went without problems.

This week: address the remaining task items, i.e.,

  • change default to async
  • make staging & prod use default
  • remove config option
  • remove sync impl

problame avatar May 13 '24 13:05 problame