neon
neon copied to clipboard
Epic: scalable async disk IO (tokio-epoll-uring)
The async value_get_reconstruct_data
project (GH project, final commit with good summary) converted more of the pageserver code base to async fn
s.
We had to revert that final commit due to performance considerations.
The current hypothesis is that most of the spawn_blocking
'ed calls were completely CPU-bound and too short-lived (single-digit microsecond range).
Why were they short-lived? The current hypothesis is that
- pageserver-internal-page-cache hit rate is not as terrible as we thought and
- the kernel page cache held most of the remaining data
Under that hypothesis, spawn_blocking has too much overhead (CPU time => latency) for work that takes single-digit nanoseconds. In fact, microbenchmarks suggest that the breakeven point is at ca 25us of work on our systems.
(More details: https://www.notion.so/neondatabase/Why-we-needed-to-revert-my-async-get_value_reconstruct_data-patch-or-what-I-learned-about-spawn_b-91f28c48b7314765bdeed6e8cb38fdce?pvs=4 )
We considered switching to a design with
- large pageserver-internal page cache
- O_DIRECT for reads, using
spawn_blocking
- hence, no kernel-page-cache hits
The problem is that we'd
- lose pageserver-internal page cache contents on each pageserver restart
- likely need to invest more time into the pageserver page cache (scalability issues).
So, this epic sets out to explore how we can continue to mostly rely on the kernel-page-cache for our read IO on the Timeline::get
code path, in a way that is more scalable than spawn_blocking.
This epic is the sister-epic to
- #4743
### High-Level
- [x] prototype & benchmark tokio-epoll-uring
- [ ] https://github.com/neondatabase/neon/issues/5479
- [x] get tokio-epoll-uring to a level of code quality that's accepted by the team
- [x] (above revert revert is a blocker for integrating tokio-epoll-uring)
- [x] integrate epoll uring behind a config option, off by default
- [ ] https://github.com/neondatabase/neon/issues/6508
- [ ] https://github.com/neondatabase/neon/issues/6509
- [x] rollout to staging
- [ ] https://github.com/neondatabase/neon/issues/6665
- [ ] https://github.com/neondatabase/cloud/issues/10481
- [ ] change default on Linux to tokio-epoll-uring
### Impl
- [x] NB: incomplete list, there was a lot of tokio-epoll-uring work before I pushed it to GitHub
- [ ] https://github.com/neondatabase/tokio-epoll-uring/pull/21
- [ ] https://github.com/neondatabase/neon/pull/6355
- [ ] https://github.com/neondatabase/neon/pull/5824
- [ ] https://github.com/neondatabase/neon/issues/6373
- [ ] https://github.com/neondatabase/neon/pull/6492
- [ ] https://github.com/neondatabase/neon/pull/6501
- [ ] https://github.com/neondatabase/aws/pull/932
- [ ] https://github.com/neondatabase/neon/issues/6663
### Tasks
- [ ] https://github.com/neondatabase/neon/issues/6368
Development happens in https://github.com/neondatabase/tokio-epoll-uring
- Last week: enabled in CI. No major issues. 1-2 flaky tests being investigated to confirm they're pre-existing issues.
- This week: benchmarking
- This week: enable in staging.
Deployed to staging: done: https://github.com/neondatabase/aws/pull/932 Benchmarking: done: https://github.com/neondatabase/neon/issues/6509#issuecomment-1917551342
Will PR the changes from my benchmarking baseline
case over the next couple of days.
Moved from https://github.com/neondatabase/neon/issues/2975#issuecomment-1926769387 by @jcsp
This week:
- Async write operations
- Avoid using spawn_blocking(block_on) because of the overhead of churning io_uring runtimes (while keeping it for prod until we cut over io_uring)
- Cut over at least one prod region/pageserver to io_uring
Monitoring staging with new metrics through this week, to get more insight on whether our limits on locked memory are going to be a problem in prod. Maybe prod next week.
This week:
- Merging change to get rid of spurious metadata reads (unblock change to remove spurious writes). Once we're sure we aren't reverting release, we can proceed to 2.
- Merge changes that do virtualfile spawn_blockings, thereby remove the spawning of lots of io_urings. Observe this change in staging this week.
- Enable on one pageserver in ap-southeast-1 (not blocked by 1,2)
This week:
- Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.
- PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough
Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.
Couldn't happen last week, will happen mid this week.
PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough
Do it after we've switched prod over.
write path
- get layer file creation patch stack merged (work happened last week)
on-demand downloads
- get PR in reviewable shape
- measure / prevent regressions using a new pagebench benchmark to churn on-demand downloads
- maybe experiment with different buffer sizes
This week:
- on-demand download should use epoll-uring
on-demand download should use epoll-uring
Implemented & benchmarked last week, shipping this week.
This week
- observe perf impact & resource usage in prod
-
- ? automate benchmark