neon
neon copied to clipboard
bypass PageCache for `compact_level0_phase1`
Parent epic:
- https://github.com/neondatabase/neon/issues/7386
Tasks
- [ ] https://github.com/neondatabase/neon/pull/8543
- [ ] https://github.com/neondatabase/aws/pull/1666
- [x] create staging log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/cdtf6y667hatcc/view
- [x] create prod log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/edtf7wry2qzggb/view
- [ ] https://github.com/neondatabase/aws/pull/1663
- [ ] https://github.com/neondatabase/aws/pull/1724
- [ ] https://github.com/neondatabase/aws/pull/1743
- [ ] https://github.com/neondatabase/infra/pull/1745
- [ ] https://github.com/neondatabase/neon/pull/8769
- [ ] https://github.com/neondatabase/infra/pull/1827
- [x] pre-prod perf evaluation: https://github.com/neondatabase/neon/issues/8184#issuecomment-2315672624
- [x] rollout: first regions
- [ ] https://github.com/neondatabase/infra/pull/1883
- [x] rollout: wait for deploy & inspect results
- [ ] https://github.com/neondatabase/infra/pull/1905
- [ ] https://github.com/neondatabase/neon/pull/8933
- [ ] https://github.com/neondatabase/neon/pull/8934
- [x] wait for deploy
- [ ] https://github.com/neondatabase/infra/pull/1903
- [ ] https://github.com/neondatabase/neon/pull/8935
- [x] wait for deploy
The compact_level0_phase1 currently uses ValueRef::load here, which internally uses read_blob with the FileBlockReader against the delta layer's VirtualFiles. This still goes through the PageCache for the data pages.
(We do use vectored get for create_image_layers, which also happens during compaction. But I missed the compact_level0_phase1.)
Complete PageCache Bypass
We can extend the load_keys step here to also load the lengths of each blob into memory (instead of just the offset)
https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L498-L503
This allows us to go directly to the VirtualFile when we use the ValueRef here:
https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L623
The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.
Do we rely on the PageCache for performance in this case?
Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers. One instance by example:
Quick Fix 1: RequestContext-scoped mini page cache.
In earlier experiments, I used a RequestContext-scoped mini page cache for this.
Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.
Proper Fix
Use streaming compaction with iterators where each iterator caches the current block.
We do have the diskbtree async stream now.
We could wrap that stream to provide a cache for the last-read block.
Did some initial scouting work on this:
- the
holesfunctionality of compaction (introduced in https://github.com/neondatabase/neon/pull/3597) requires scanning of all keys before scanning all values- tl;dr for what holes does:
- So the trade-off is that we rather create smaller L1s than creating sparse delta space on top of image layers.
- And if we're ingesting a lot of data on both sides of the hole, then this is definitely the right trade-off because we will have full-sized L1s on either side.
- But if we have little L0 data, then we create the small L1s.
- And all of this is necessary because we have the stupid count_deltas as the trigger for image layer creation.
- More details in private Slack DM
- tl;dr for what holes does:
- we need to preserve the holes functionality until we have a better approach for image layer creation at the top
- testing: I can't find any dedicated Rust unit tests. Would be nice to extract the existing logic enough to get coverage for existing behaviors, but that's a lot of work.
Status update:
- Initial PR merged #8543
- Read the PR description, it contains a detailed rollout strategy.
- default neon_local, unit tests, regression tests, and Staging are running a validation mode
- log scraping alerts for staging and prod are configured
- these will detect validation failures and forward to #on-call-storage-{staging,prod}-stream
- post-merge of above PR, I ran some manual tests to inspect overhead of validation mode
- results here: https://github.com/neondatabase/neon/pull/8543#issuecomment-2260710812
- pre-prod and prod require the following PR to be merged before the release next week:
- https://github.com/neondatabase/aws/pull/1663
- If we don't merge it, pre-prod and prod will run in validating mode as well, which we probably don't want yet.
Status update:
- No validation failures in Staging
- Spot-checked flaky tests dashboard & error messages, no indication that the validation failure is happening there, either.
- https://github.com/neondatabase/cloud/issues/16404 would be better than spot-checking
Plan / needs decision:
- Validation mode in pre-prod region? Measure CPU impact, see if low enough so we can enable validation mdoe in prod. Probablistic validation?
- https://github.com/neondatabase/aws/pull/1724
Status update: validation mode enabled in pre-prod
Pre-Prod Analysis
First night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199
However, here's the list of dashboards I looked at:
- Cloudbench metrics
- 4GS
- Node Exporter Full
- CPU
- VirtualFile
- /storage disk IOPS
- Memory
- Fleet overview => memory usage
- Time spent doing the compactions, scraped from logs
Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)
- no noticable CPU / disk IOPS impact
- but compaction iterations take about 2x wall clock time
- makes sense because validation does about twice the amount of VirtualFile calls, with no concurrency
- we're not bottlenecking on disk however
Screenshot from the log scraping query, which I found quite insightful
Can we enable it in prod?
What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput.
So, is prod compaction throughput bottlenecked on the global semaphore?
We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):
(pageserver_background_loop_semaphore_wait_start_count{instance="pageserver-8.eu-west-1.aws.neon.build",task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) pageserver_tenant_states_count{state="Active"}
There are some places where we have sampling skew, so, do clamping
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
the p99.9 instance in that plot looks like this
quantile(0.999,
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)
average like this
avg(
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)
For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849
Decision from today's sync meeting:
- https://github.com/neondatabase/infra/pull/1745
- Create metric to measure semaphore contention.
- https://github.com/neondatabase/neon/pull/8769
- Table decision for remaining regions until EOW / next week.
- discussion thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1724404251430089
This week, as per discussion thread:
- analyze perf impact in pre-prod (enable new mode, without validation)
- AFTER qualifying this week's release
- https://github.com/neondatabase/infra/pull/1827
- no changes to prod
Results from pre-prod are looking good.
- cloudbench results for that time range
- Less wall clock time for the same workload (see screenshot below) => GOOD
- unchanged memory for that time range unchange
- unchanged CPU for that time range
Plan:
- Roll the non-validating mode into more prod regions this week.
- https://github.com/neondatabase/infra/pull/1883
Results from rollout shared in this Slack thread
tl;dr:
- halved the PS PageCache eviction rate, and stabilized it a lot
- halved the metric "wall clock time spent on compaction / ingested bytes" (see query below)
sum by (neon_region) (rate(pageserver_storage_operations_seconds_global_sum{operation="compact",neon_region=~"$neon_region"}[$__rate_interval]))
/
sum by (neon_region) (rate(pageserver_wal_ingest_bytes_received[$__rate_interval] / 1e6))