neon icon indicating copy to clipboard operation
neon copied to clipboard

bypass PageCache for `compact_level0_phase1`

Open problame opened this issue 1 year ago • 10 comments
trafficstars

Parent epic:

  • https://github.com/neondatabase/neon/issues/7386

Tasks

  • [ ] https://github.com/neondatabase/neon/pull/8543
  • [ ] https://github.com/neondatabase/aws/pull/1666
  • [x] create staging log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/cdtf6y667hatcc/view
  • [x] create prod log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/edtf7wry2qzggb/view
  • [ ] https://github.com/neondatabase/aws/pull/1663
  • [ ] https://github.com/neondatabase/aws/pull/1724
  • [ ] https://github.com/neondatabase/aws/pull/1743
  • [ ] https://github.com/neondatabase/infra/pull/1745
  • [ ] https://github.com/neondatabase/neon/pull/8769
  • [ ] https://github.com/neondatabase/infra/pull/1827
  • [x] pre-prod perf evaluation: https://github.com/neondatabase/neon/issues/8184#issuecomment-2315672624
  • [x] rollout: first regions
  • [ ] https://github.com/neondatabase/infra/pull/1883
  • [x] rollout: wait for deploy & inspect results
  • [ ] https://github.com/neondatabase/infra/pull/1905
  • [ ] https://github.com/neondatabase/neon/pull/8933
  • [ ] https://github.com/neondatabase/neon/pull/8934
  • [x] wait for deploy
  • [ ] https://github.com/neondatabase/infra/pull/1903
  • [ ] https://github.com/neondatabase/neon/pull/8935
  • [x] wait for deploy

The compact_level0_phase1 currently uses ValueRef::load here, which internally uses read_blob with the FileBlockReader against the delta layer's VirtualFiles. This still goes through the PageCache for the data pages.

(We do use vectored get for create_image_layers, which also happens during compaction. But I missed the compact_level0_phase1.)

Complete PageCache Bypass

We can extend the load_keys step here to also load the lengths of each blob into memory (instead of just the offset)

https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L498-L503

This allows us to go directly to the VirtualFile when we use the ValueRef here:

https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L623

The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.

Do we rely on the PageCache for performance in this case?

Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers. One instance by example:

image

Quick Fix 1: RequestContext-scoped mini page cache.

In earlier experiments, I used a RequestContext-scoped mini page cache for this.

Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.

Proper Fix

Use streaming compaction with iterators where each iterator caches the current block.

We do have the diskbtree async stream now.

We could wrap that stream to provide a cache for the last-read block.

problame avatar Jun 27 '24 12:06 problame

Did some initial scouting work on this:

  • the holes functionality of compaction (introduced in https://github.com/neondatabase/neon/pull/3597) requires scanning of all keys before scanning all values
    • tl;dr for what holes does:
      • So the trade-off is that we rather create smaller L1s than creating sparse delta space on top of image layers.
      • And if we're ingesting a lot of data on both sides of the hole, then this is definitely the right trade-off because we will have full-sized L1s on either side.
      • But if we have little L0 data, then we create the small L1s.
      • And all of this is necessary because we have the stupid count_deltas as the trigger for image layer creation.
    • More details in private Slack DM
  • we need to preserve the holes functionality until we have a better approach for image layer creation at the top
  • testing: I can't find any dedicated Rust unit tests. Would be nice to extract the existing logic enough to get coverage for existing behaviors, but that's a lot of work.

problame avatar Jul 22 '24 10:07 problame

Status update:

  • Initial PR merged #8543
    • Read the PR description, it contains a detailed rollout strategy.
  • default neon_local, unit tests, regression tests, and Staging are running a validation mode
    • log scraping alerts for staging and prod are configured
    • these will detect validation failures and forward to #on-call-storage-{staging,prod}-stream
  • post-merge of above PR, I ran some manual tests to inspect overhead of validation mode
    • results here: https://github.com/neondatabase/neon/pull/8543#issuecomment-2260710812
  • pre-prod and prod require the following PR to be merged before the release next week:
    • https://github.com/neondatabase/aws/pull/1663
    • If we don't merge it, pre-prod and prod will run in validating mode as well, which we probably don't want yet.

problame avatar Jul 31 '24 14:07 problame

Status update:

  • No validation failures in Staging
  • Spot-checked flaky tests dashboard & error messages, no indication that the validation failure is happening there, either.
    • https://github.com/neondatabase/cloud/issues/16404 would be better than spot-checking

Plan / needs decision:

  • Validation mode in pre-prod region? Measure CPU impact, see if low enough so we can enable validation mdoe in prod. Probablistic validation?
    • https://github.com/neondatabase/aws/pull/1724

problame avatar Aug 12 '24 09:08 problame

Status update: validation mode enabled in pre-prod

Pre-Prod Analysis

First night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199

However, here's the list of dashboards I looked at:

Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)

  • no noticable CPU / disk IOPS impact
  • but compaction iterations take about 2x wall clock time
    • makes sense because validation does about twice the amount of VirtualFile calls, with no concurrency
    • we're not bottlenecking on disk however

Screenshot from the log scraping query, which I found quite insightful Image

Can we enable it in prod?

What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput.

So, is prod compaction throughput bottlenecked on the global semaphore?

We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):

(pageserver_background_loop_semaphore_wait_start_count{instance="pageserver-8.eu-west-1.aws.neon.build",task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) pageserver_tenant_states_count{state="Active"}

There are some places where we have sampling skew, so, do clamping

clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)

Image

the p99.9 instance in that plot looks like this

quantile(0.999,
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

Image

average like this

avg(
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

Image

problame avatar Aug 16 '24 09:08 problame

For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849

problame avatar Aug 16 '24 14:08 problame

Decision from today's sync meeting:

  1. https://github.com/neondatabase/infra/pull/1745
  2. Create metric to measure semaphore contention.
  • https://github.com/neondatabase/neon/pull/8769
  1. Table decision for remaining regions until EOW / next week.
  • discussion thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1724404251430089

problame avatar Aug 19 '24 14:08 problame

This week, as per discussion thread:

  • analyze perf impact in pre-prod (enable new mode, without validation)
    • AFTER qualifying this week's release
    • https://github.com/neondatabase/infra/pull/1827
  • no changes to prod

problame avatar Aug 26 '24 08:08 problame

Results from pre-prod are looking good.

Image

problame avatar Aug 28 '24 15:08 problame

Plan:

  • Roll the non-validating mode into more prod regions this week.
    • https://github.com/neondatabase/infra/pull/1883

problame avatar Sep 02 '24 11:09 problame

Results from rollout shared in this Slack thread

tl;dr:

  • halved the PS PageCache eviction rate, and stabilized it a lot
  • halved the metric "wall clock time spent on compaction / ingested bytes" (see query below)
sum by (neon_region) (rate(pageserver_storage_operations_seconds_global_sum{operation="compact",neon_region=~"$neon_region"}[$__rate_interval]))
/
sum by (neon_region) (rate(pageserver_wal_ingest_bytes_received[$__rate_interval] / 1e6))

problame avatar Sep 05 '24 10:09 problame