neon Epic: vectored Timeline::get

RFC: #6250 This epic description is a condensed form of that RFC + for tracking issues.

Motivation

We should have a vectored aka batched aka scatter-gather style alternative API for Timeline::get. Having such an API unlocks:

more efficient basebackup
batched IO during compaction (useful for strides of unchanged pages)
page_service: expose vectored get_page_at_lsn for compute (=> good for seqscan / prefetch)
- if on-demand SLRU downloads land before vectored Timeline::get, on-demand SLRU downloads will still benefit from this API

DoD

There is a new variant of Timeline::get, called Timeline::get_vectored. It takes as arguments an lsn: Lsn and a src: &[KeyVec] where struct KeyVec { base: Key, count: usize }. It is up to the implementor to figure out a suitable and efficient way to return the reconstructed page images.

An ideal solution will

Visit each struct Layer at most once.
For each visited layer, call Layer::get_value_reconstruct_data at most once.
- This means, read each DiskBtree page at most once.
Facilitate merging of the reads we issue to the OS and eventually NVMe.

For more details, see the RFC #6250

Implementation Ideas & Hints

See the RFC: #6250

Issue Tracking

### High-Level Steps
- [x] Introduce vectored Timeline::get API
- [x] Vectored Layer Map traversal
- [x] Vectored `Layer::get_value_reconstruct_data` / `DiskBtree`
- [ ] Facilitate merging of the reads we issue to the OS and eventually NVMe.`
- [ ] https://github.com/neondatabase/neon/issues/6904

### PRs, for Future Reference
- [ ] https://github.com/neondatabase/neon/pull/6372
- [ ] https://github.com/neondatabase/neon/pull/6469
- [ ] https://github.com/neondatabase/neon/pull/6461
- [ ] https://github.com/neondatabase/neon/pull/6543
- [ ] https://github.com/neondatabase/neon/pull/6544
- [ ] https://github.com/neondatabase/neon/pull/6576
- [ ] https://github.com/neondatabase/neon/pull/6780

### Follow-Ups
- [ ] https://github.com/neondatabase/neon/issues/6435
- [ ] https://github.com/neondatabase/neon/issues/6434

Jan 08 '24 16:01 problame

Status:

Merged range search for layer map
This week: get it working end to end with a test -> then spin off smaller PRs.
Remaining: 2 weeks optimistic, 3 weeks perhaps.

Jan 29 '24 11:01 jcsp

Status:

Implementation PR is open - needs debugging of regress test failures, but otherwise reviewable
This week:
- fix bugs surfaced by regress tests
- update pagebench to use vectored get
- I/O coalescing

Feb 05 '24 11:02 VladLazar

Status

Last week:

stabilised impl https://github.com/neondatabase/neon/pull/6576
went through one round of review

This week:

pagebench
Disk IO improvements

Feb 12 '24 10:02 VladLazar

This week:

Aim to get both changes merged (vectored get + IO optimization extensions)
Define deployment plan.
Enable in staging as soon as merged (end of this week or start of next)

Feb 19 '24 11:02 jcsp

Last week:

merged https://github.com/neondatabase/neon/pull/6576
benchmarking
opened https://github.com/neondatabase/neon/pull/6780
more validation testing using get_page_latest_lsn bench

This week:

merge disk IO stuff
start deployment

Feb 26 '24 10:02 VladLazar

Last week:

released to staging which caused panics
identified the issues in delta layer index traversal

This week:

fix the issues mentioned above and write tests for them
enable vectored get in staging again

Mar 04 '24 10:03 VladLazar

This week:

Fixed a bug, will re-enable in staging.
Go to prod once we've had a clean week in staging

Mar 11 '24 11:03 jcsp

No panics since last Monday in staging.

Not targeting for prod this week: want more runs of Peter's tests. More runs in CI since that parametrization was fixed.

Consider rolling to prod midweek or next week.

Mar 18 '24 11:03 jcsp

Last week:

Deployed to IL and monitored: looking good
- basebackup latency stayed stable which was expected (low slru count)
- vectored latency was lower after deploy (promising)

This week:

Deploy to AP regions & monitor

Mar 25 '24 11:03 VladLazar

Just monitoring -- already out in prod in all regions, it looks ok so far.

Apr 22 '24 13:04 jcsp

Last week:

Perf gains held up (e.g. p999 eu-central-1)
No major latency swings and degradation pattern looks ok

This week:

Close this? (or do we wait for the follow-up #7381)

Apr 29 '24 09:04 VladLazar

Let's declare victory on this: #7381 is it's own thing

Apr 29 '24 09:04 jcsp

neon neon copied to clipboard

Epic: vectored Timeline::get

Motivation

DoD

Implementation Ideas & Hints

Issue Tracking

Status

neon
neon copied to clipboard