neon page_service: more efficient page

Problem

In #5980 the page service connection handler gets a simple piece of logic for finding the right Timeline: at connection time, it picks an arbitrary Timeline, and then when handling individual page requests it checks if the original timeline is the correct shard, and if not looks one up.

This is pretty slow in the case where we have to go look up the other timeline, because we take the big tenants manager lock.

Summary of changes

Add a shard_timelines map of ShardIndex to Timeline on the page service connection handler
When looking up a Timeline for a particular ShardIndex, consult shard_timelines to avoid hitting the TenantsManager unless we really need to.

Checklist before requesting a review

[ ] I have performed a self-review of my code.
[ ] If it is a core feature, I have added thorough tests.
[ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
[ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

[ ] Do not forget to reformat commit message to not include the above checklist

Dec 05 '23 10:12 jcsp

2244 tests run: 2160 passed, 0 failed, 84 skipped (full report)

Flaky tests (1)

Postgres 14

test_statvfs_pressure_usage: debug

Code coverage (full report)

functions: 54.6% (10201 of 18669 functions)
lines: 81.6% (58713 of 71927 lines)

_{The comment gets automatically updated with the latest test results
90fd1f9509ae0326d73f822167410e2dbbe8da45 at 2024-01-11T19:53:21.530Z :recycle:}

Dec 05 '23 10:12 github-actions[bot]

Why is it necessary to intermingle the cancellation changes with the more efficient timeline lookups?

The cancellation changes are to go from respecting one timeline's cancellation to respecting multiple: that is necessary to do the more efficient lookups (with a cache) while remaining well behaved during shutdown, if we want to keep the GateGuard for a timeline across multiple requests (which we do, to achieve the efficiency).

I didn't plan on adding the cancellation changes until I made the main change, and realized that I had nothing sensible to pass into flush_cancellable any more :-)

Jan 11 '24 17:01 jcsp

Why is it necessary to intermingle the cancellation changes with the more efficient timeline lookups?

The cancellation changes are to go from respecting one timeline's cancellation to respecting multiple: that is necessary to do the more efficient lookups (with a cache) while remaining well behaved during shutdown, if we want to keep the GateGuard for a timeline across multiple requests (which we do, to achieve the efficiency).

Please update the PR description / commit message to include this. Helps setting the context for reviewing.

Jan 12 '24 14:01 problame

(If there's no rush, let's wait with merging this until after next week's release, staging soak time over the weekend isn't super effective)

Jan 12 '24 15:01 problame

I wonder if the "keep gate open" aspect could have been solved more cleanly by another level of indirection. What we're really keeping open is our PageServerHandler instance's view on the shard configuration.

The sharding aspect is only part of the reason we hold the gate: holding the gate over a get() call is also important so that we can't e.g. detach a tenant while some request handler is in the middle of servicing a read. This is important because reads also implicitly do writes when promoting to the load disk cache, and we must not allow that to continue to happen after a timeline shutdown() has exited.

So the primary reason for holding the gate is that we don't want to let timelines finish shutting down while some task is still reading/writing their local disk state, and the way the gateguard helps with sharding correctness is a pleasant case of the same mechanism serving two purposes.

Jan 12 '24 16:01 jcsp

neon
neon copied to clipboard

page_service: more efficient page_service -> shard lookup

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

2244 tests run: 2160 passed, 0 failed, 84 skipped (full report)

Postgres 14

Code coverage (full report)

neon neon copied to clipboard

page_service: more efficient page_service -> shard lookup

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

2244 tests run: 2160 passed, 0 failed, 84 skipped (full report)

Postgres 14

Code coverage (full report)

neon
neon copied to clipboard