neon Epic: scalable async disk IO (tokio-epoll-uring)

The async value_get_reconstruct_data project (GH project, final commit with good summary) converted more of the pageserver code base to async fns.

We had to revert that final commit due to performance considerations. The current hypothesis is that most of the spawn_blocking'ed calls were completely CPU-bound and too short-lived (single-digit microsecond range).

Why were they short-lived? The current hypothesis is that

pageserver-internal-page-cache hit rate is not as terrible as we thought and
the kernel page cache held most of the remaining data

Under that hypothesis, spawn_blocking has too much overhead (CPU time => latency) for work that takes single-digit nanoseconds. In fact, microbenchmarks suggest that the breakeven point is at ca 25us of work on our systems.

(More details: https://www.notion.so/neondatabase/Why-we-needed-to-revert-my-async-get_value_reconstruct_data-patch-or-what-I-learned-about-spawn_b-91f28c48b7314765bdeed6e8cb38fdce?pvs=4 )

We considered switching to a design with

large pageserver-internal page cache
O_DIRECT for reads, using spawn_blocking
hence, no kernel-page-cache hits

The problem is that we'd

lose pageserver-internal page cache contents on each pageserver restart
likely need to invest more time into the pageserver page cache (scalability issues).

So, this epic sets out to explore how we can continue to mostly rely on the kernel-page-cache for our read IO on the Timeline::get code path, in a way that is more scalable than spawn_blocking.

This epic is the sister-epic to

#4743

### High-Level
- [x] prototype & benchmark tokio-epoll-uring
- [ ] https://github.com/neondatabase/neon/issues/5479
- [x] get tokio-epoll-uring to a level of code quality that's accepted by the team
- [x] (above revert revert is a blocker for integrating tokio-epoll-uring)
- [x] integrate epoll uring behind a config option, off by default
- [ ] https://github.com/neondatabase/neon/issues/6508
- [ ] https://github.com/neondatabase/neon/issues/6509
- [x] rollout to staging
- [ ] https://github.com/neondatabase/neon/issues/6665
- [ ] https://github.com/neondatabase/cloud/issues/10481
- [ ] change default on Linux to tokio-epoll-uring

### Impl
- [x] NB: incomplete list, there was a lot of tokio-epoll-uring work before I pushed it to GitHub
- [ ] https://github.com/neondatabase/tokio-epoll-uring/pull/21
- [ ] https://github.com/neondatabase/neon/pull/6355
- [ ] https://github.com/neondatabase/neon/pull/5824
- [ ] https://github.com/neondatabase/neon/issues/6373
- [ ] https://github.com/neondatabase/neon/pull/6492
- [ ] https://github.com/neondatabase/neon/pull/6501
- [ ] https://github.com/neondatabase/aws/pull/932
- [ ] https://github.com/neondatabase/neon/issues/6663

### Tasks
- [ ] https://github.com/neondatabase/neon/issues/6368

Jul 18 '23 08:07 problame

Development happens in https://github.com/neondatabase/tokio-epoll-uring

Aug 22 '23 10:08 problame

Last week: enabled in CI. No major issues. 1-2 flaky tests being investigated to confirm they're pre-existing issues.
This week: benchmarking
This week: enable in staging.

Jan 29 '24 11:01 jcsp

Deployed to staging: done: https://github.com/neondatabase/aws/pull/932 Benchmarking: done: https://github.com/neondatabase/neon/issues/6509#issuecomment-1917551342

Will PR the changes from my benchmarking baseline case over the next couple of days.

Jan 30 '24 17:01 problame

Moved from https://github.com/neondatabase/neon/issues/2975#issuecomment-1926769387 by @jcsp

This week:

Async write operations
Avoid using spawn_blocking(block_on) because of the overhead of churning io_uring runtimes (while keeping it for prod until we cut over io_uring)
Cut over at least one prod region/pageserver to io_uring

Feb 05 '24 11:02 problame

Monitoring staging with new metrics through this week, to get more insight on whether our limits on locked memory are going to be a problem in prod. Maybe prod next week.

Feb 12 '24 11:02 jcsp

This week:

Merging change to get rid of spurious metadata reads (unblock change to remove spurious writes). Once we're sure we aren't reverting release, we can proceed to 2.
Merge changes that do virtualfile spawn_blockings, thereby remove the spawning of lots of io_urings. Observe this change in staging this week.
Enable on one pageserver in ap-southeast-1 (not blocked by 1,2)

Feb 19 '24 11:02 jcsp

This week:

Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.
PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough

Feb 26 '24 11:02 jcsp

Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.

Couldn't happen last week, will happen mid this week.

PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough

Do it after we've switched prod over.

write path

get layer file creation patch stack merged (work happened last week)

on-demand downloads

get PR in reviewable shape
measure / prevent regressions using a new pagebench benchmark to churn on-demand downloads
maybe experiment with different buffer sizes

Mar 04 '24 11:03 problame

This week:

on-demand download should use epoll-uring

Mar 11 '24 11:03 jcsp

on-demand download should use epoll-uring

Implemented & benchmarked last week, shipping this week.

This week

observe perf impact & resource usage in prod
- ? automate benchmark

Mar 18 '24 11:03 problame

neon neon copied to clipboard

Epic: scalable async disk IO (tokio-epoll-uring)

neon
neon copied to clipboard