neon icon indicating copy to clipboard operation
neon copied to clipboard

pageserver: separate metadata and data pages in DatadirModification

Open jcsp opened this issue 1 year ago • 1 comments

Problem

Currently, DatadirModification keeps a key-indexed map of all pending writes, even though we (almost) never need to read back dirty pages for anything other than metadata pages (e.g. relation sizes).

Related: https://github.com/neondatabase/neon/issues/6345

Summary of changes

  • commit() modifications before ingesting database creation wal records, so that they are guaranteed to be able to get() everything they need directly from the underlying Timeline.
  • Split dirty pages in DatadirModification into pending_metadata_pages and pending_data_pages. The data ones don't need to be in a key-addressable format, so they just go in a Vec instead.
  • Special case handling of zero-page writes in DatadirModification, putting them in a map which is flushed on the end of a WAL record. This handles the case where during ingest, we might first write a zero page, and then ingest a postgres write to that page. We used to do this via the key-indexed map of writes, but in this PR we change the data page write path to not bother indexing these by key.

My least favorite thing about this PR is that I needed to change the DatadirModification interface to add the on_record_end call. This is not very invasive because there's really only one place we use it, but it changes the object's behaviour from being clearly an aggregation of many records to having some per-record state. I could avoid this by implicitly doing the work when someone calls set_lsn or commit -- I'm open to opinions on whether that's cleaner or dirtier.

Performance

There may be some efficiency improvement here, but the primary motivation is to enable an earlier stage of ingest to operate without access to a Timeline. The pending_data_pages part is the "fast path" bulk write data that can in principle be generated without a Timeline, in parallel with other ingest batches, and ultimately on the safekeeper.

test_bulk_insert on AX102 shows approximately the same results as in the previous PR #8591:

------------------------------ Benchmark results -------------------------------
test_bulk_insert[neon-release-pg16].insert: 23.577 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 637 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.264 s
test_bulk_insert[neon-release-pg16].compaction: 0.052 s

Checklist before requesting a review

  • [ ] I have performed a self-review of my code.
  • [ ] If it is a core feature, I have added thorough tests.
  • [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • [ ] Do not forget to reformat commit message to not include the above checklist

jcsp avatar Aug 06 '24 16:08 jcsp

3834 tests run: 3728 passed, 0 failed, 106 skipped (full report)


Flaky tests (3)

Postgres 16

Code coverage* (full report)

  • functions: 32.5% (7423 of 22858 functions)
  • lines: 50.6% (60119 of 118790 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
aa0dc4fa2f749a775b1f2bf6e609ef9429cf0b6c at 2024-09-03T15:00:16.337Z :recycle:

github-actions[bot] avatar Aug 27 '24 13:08 github-actions[bot]

I'm not sure the zero_data_pages is worth it. You can just use a Bytes::from_static(ZERO_PAGE).

The purpose isn't to avoid storing the zero page: it's to avoid storing the Key-indexed map for data page writes in general. Previously we had all writes in the pending_updates map indexed by Key, and the only time that was useful for data pages was for figuring out if there was a previous zero page write that we needed to replace with a subsequent write. So in this PR, pending_zero_data_pages acts as a much smaller map that just serves that particular purpose, rather than storing a total Key->value map for all data pages.

jcsp avatar Sep 02 '24 14:09 jcsp

When merging main, I noticed that changes from #8648 were calling is_valid_key_on_write_path during ingest at some points where we can't do that any more (because they get converted down to i128 earlier, in non-fallible functions).

Rather than making lots more functions fallible, I'm inclined to leave those calls out (that's what I've done in the merge commit), and rely on the single check in Timeline::put, which all writes will hit eventually. @skyzh does that sound okay to you?

jcsp avatar Sep 02 '24 17:09 jcsp

sounds good :)

skyzh avatar Sep 02 '24 18:09 skyzh

and rely on the single check in Timeline::put, which all writes will hit eventually

it seems that Timeline::put has #[cfg(test)], it's only used in test cases? 🤣

skyzh avatar Sep 03 '24 18:09 skyzh

Re-reviewed this change, just to be completely sure we only call DatadirModification::get with metadata keys except the one legacy edge case. Can confirm it's true.

I'm wondering how much effort / code churn it would be to require a different Key type argument for DatadirModification::get that, at compile time, ensures the key is metadata.

Like, a wrapper type

struct MetadataKey(Key);

impl From<MetadataKey> for Key { ... }

and change all the constants like DBDIR_KEY and functions like rel_size_to_key to return MetadataKey.

problame avatar Sep 07 '24 11:09 problame