neon
neon copied to clipboard
page_service: more efficient page_service -> shard lookup
Problem
In #5980 the page service connection handler gets a simple piece of logic for finding the right Timeline: at connection time, it picks an arbitrary Timeline, and then when handling individual page requests it checks if the original timeline is the correct shard, and if not looks one up.
This is pretty slow in the case where we have to go look up the other timeline, because we take the big tenants manager lock.
Summary of changes
- Add a
shard_timelines
map of ShardIndex to Timeline on the page service connection handler - When looking up a Timeline for a particular ShardIndex, consult
shard_timelines
to avoid hitting the TenantsManager unless we really need to.
Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.
Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above checklist
2244 tests run: 2160 passed, 0 failed, 84 skipped (full report)
Code coverage (full report)
-
functions
:54.6% (10201 of 18669 functions)
-
lines
:81.6% (58713 of 71927 lines)
90fd1f9509ae0326d73f822167410e2dbbe8da45 at 2024-01-11T19:53:21.530Z :recycle:
Why is it necessary to intermingle the cancellation changes with the more efficient timeline lookups?
The cancellation changes are to go from respecting one timeline's cancellation to respecting multiple: that is necessary to do the more efficient lookups (with a cache) while remaining well behaved during shutdown, if we want to keep the GateGuard for a timeline across multiple requests (which we do, to achieve the efficiency).
I didn't plan on adding the cancellation changes until I made the main change, and realized that I had nothing sensible to pass into flush_cancellable any more :-)
Why is it necessary to intermingle the cancellation changes with the more efficient timeline lookups?
The cancellation changes are to go from respecting one timeline's cancellation to respecting multiple: that is necessary to do the more efficient lookups (with a cache) while remaining well behaved during shutdown, if we want to keep the GateGuard for a timeline across multiple requests (which we do, to achieve the efficiency).
Please update the PR description / commit message to include this. Helps setting the context for reviewing.
(If there's no rush, let's wait with merging this until after next week's release, staging soak time over the weekend isn't super effective)
I wonder if the "keep gate open" aspect could have been solved more cleanly by another level of indirection. What we're really keeping open is our PageServerHandler instance's view on the shard configuration.
The sharding aspect is only part of the reason we hold the gate: holding the gate over a get() call is also important so that we can't e.g. detach a tenant while some request handler is in the middle of servicing a read. This is important because reads also implicitly do writes when promoting to the load disk cache, and we must not allow that to continue to happen after a timeline shutdown() has exited.
So the primary reason for holding the gate is that we don't want to let timelines finish shutting down while some task is still reading/writing their local disk state, and the way the gateguard helps with sharding correctness is a pleasant case of the same mechanism serving two purposes.