neon icon indicating copy to clipboard operation
neon copied to clipboard

Epic: scalable async disk IO (tokio-epoll-uring)

Open problame opened this issue 1 year ago • 7 comments

The async value_get_reconstruct_data project (GH project, final commit with good summary) converted more of the pageserver code base to async fns.

We had to revert that final commit due to performance considerations. The current hypothesis is that most of the spawn_blocking'ed calls were completely CPU-bound and too short-lived (single-digit microsecond range).

Why were they short-lived? The current hypothesis is that

  1. pageserver-internal-page-cache hit rate is not as terrible as we thought and
  2. the kernel page cache held most of the remaining data

Under that hypothesis, spawn_blocking has too much overhead (CPU time => latency) for work that takes single-digit nanoseconds. In fact, microbenchmarks suggest that the breakeven point is at ca 25us of work on our systems.

(More details: https://www.notion.so/neondatabase/Why-we-needed-to-revert-my-async-get_value_reconstruct_data-patch-or-what-I-learned-about-spawn_b-91f28c48b7314765bdeed6e8cb38fdce?pvs=4 )

We considered switching to a design with

  • large pageserver-internal page cache
  • O_DIRECT for reads, using spawn_blocking
  • hence, no kernel-page-cache hits

The problem is that we'd

  • lose pageserver-internal page cache contents on each pageserver restart
  • likely need to invest more time into the pageserver page cache (scalability issues).

So, this epic sets out to explore how we can continue to mostly rely on the kernel-page-cache for our read IO on the Timeline::get code path, in a way that is more scalable than spawn_blocking.


This epic is the sister-epic to

  • #4743
### High-Level
- [x] prototype & benchmark tokio-epoll-uring
- [ ] https://github.com/neondatabase/neon/issues/5479
- [x] get tokio-epoll-uring to a level of code quality that's accepted by the team
- [x] (above revert revert is a blocker for integrating tokio-epoll-uring)
- [x] integrate epoll uring behind a config option, off by default
- [ ] https://github.com/neondatabase/neon/issues/6508
- [ ] https://github.com/neondatabase/neon/issues/6509
- [x] rollout to staging
- [ ] https://github.com/neondatabase/neon/issues/6665
- [ ] https://github.com/neondatabase/cloud/issues/10481
- [ ] change default on Linux to tokio-epoll-uring
### Impl
- [x] NB: incomplete list, there was a lot of tokio-epoll-uring work before I pushed it to GitHub
- [ ] https://github.com/neondatabase/tokio-epoll-uring/pull/21
- [ ] https://github.com/neondatabase/neon/pull/6355
- [ ] https://github.com/neondatabase/neon/pull/5824
- [ ] https://github.com/neondatabase/neon/issues/6373
- [ ] https://github.com/neondatabase/neon/pull/6492
- [ ] https://github.com/neondatabase/neon/pull/6501
- [ ] https://github.com/neondatabase/aws/pull/932
- [ ] https://github.com/neondatabase/neon/issues/6663
### Tasks
- [ ] https://github.com/neondatabase/neon/issues/6368

problame avatar Jul 18 '23 08:07 problame

Development happens in https://github.com/neondatabase/tokio-epoll-uring

problame avatar Aug 22 '23 10:08 problame

  • Last week: enabled in CI. No major issues. 1-2 flaky tests being investigated to confirm they're pre-existing issues.
  • This week: benchmarking
  • This week: enable in staging.

jcsp avatar Jan 29 '24 11:01 jcsp

Deployed to staging: done: https://github.com/neondatabase/aws/pull/932 Benchmarking: done: https://github.com/neondatabase/neon/issues/6509#issuecomment-1917551342

Will PR the changes from my benchmarking baseline case over the next couple of days.

problame avatar Jan 30 '24 17:01 problame

Moved from https://github.com/neondatabase/neon/issues/2975#issuecomment-1926769387 by @jcsp


This week:

  • Async write operations
  • Avoid using spawn_blocking(block_on) because of the overhead of churning io_uring runtimes (while keeping it for prod until we cut over io_uring)
  • Cut over at least one prod region/pageserver to io_uring

problame avatar Feb 05 '24 11:02 problame

Monitoring staging with new metrics through this week, to get more insight on whether our limits on locked memory are going to be a problem in prod. Maybe prod next week.

jcsp avatar Feb 12 '24 11:02 jcsp

This week:

  1. Merging change to get rid of spurious metadata reads (unblock change to remove spurious writes). Once we're sure we aren't reverting release, we can proceed to 2.
  2. Merge changes that do virtualfile spawn_blockings, thereby remove the spawning of lots of io_urings. Observe this change in staging this week.
  3. Enable on one pageserver in ap-southeast-1 (not blocked by 1,2)

jcsp avatar Feb 19 '24 11:02 jcsp

This week:

  • Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.
  • PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough

jcsp avatar Feb 26 '24 11:02 jcsp

Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.

Couldn't happen last week, will happen mid this week.

PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough

Do it after we've switched prod over.


write path

  • get layer file creation patch stack merged (work happened last week)

on-demand downloads

  • get PR in reviewable shape
  • measure / prevent regressions using a new pagebench benchmark to churn on-demand downloads
  • maybe experiment with different buffer sizes

problame avatar Mar 04 '24 11:03 problame

This week:

  • on-demand download should use epoll-uring

jcsp avatar Mar 11 '24 11:03 jcsp

on-demand download should use epoll-uring

Implemented & benchmarked last week, shipping this week.

This week

  • observe perf impact & resource usage in prod
    • ? automate benchmark

problame avatar Mar 18 '24 11:03 problame