Christian Schwarz

Results 385 comments of Christian Schwarz
trafficstars

Status update: validation mode enabled in pre-prod # Pre-Prod Analysis [First night's prodlike cloudbench run](https://github.com/neondatabase/cloud/actions/runs/10408266450) had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199 However, here's the list of dashboards...

For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849

Decision from today's sync meeting: 1. https://github.com/neondatabase/infra/pull/1745 2. Create metric to measure semaphore contention. - https://github.com/neondatabase/neon/pull/8769 3. Table decision for remaining regions until EOW / next week. - discussion thread:...

This week, as per discussion thread: * analyze perf impact in pre-prod (enable new mode, without validation) * AFTER qualifying this week's release * https://github.com/neondatabase/infra/pull/1827 * no changes to prod

Results from pre-prod are looking good. * [cloudbench results for that time range](https://neonprod.grafana.net/d/fdbl98ifhoc1se/cloudbench-productionlike-staging?orgId=1&from=1724718932000&to=1724858338000&var-datasource=HUNg6jvVk&var-region=eu-west-1&var-percentiles=0.99) * **[Less wall clock time](https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22st4%22:%7B%22datasource%22:%22xHHYY0dVz%22,%22queries%22:%5B%7B%22refId%22:%22C%22,%22expr%22:%22sum%20by%20%28hostname%29%20%28%5Cnsum_over_time%28%7Bneon_region%3D%5C%22eu-west-1%5C%22,unit%3D%5C%22pageserver.service%5C%22%7D%7C~%20%60compact_level0_phase1.%2Astats_json%60%20%7C%20regexp%20%60.%2Astats_json%3D%28%3FP%3Cstats_json%3E.%2A%29%60%20%7C%20line_format%20%60%7B%7B.stats_json%7D%7D%60%20%7C%20regexp%20%60write_layer_files_micros%5C%5C%5C%5C%5C%22:%28%3FP%3Cwrite_layer_files_micros%3E%5C%5Cd%2B%29%60%20%7C%20unwrap%20write_layer_files_micros%20%5B14h%5D%29%5Cn%29%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22xHHYY0dVz%22%7D,%22editorMode%22:%22code%22,%22hide%22:false,%22step%22:%221h%22%7D%5D,%22range%22:%7B%22from%22:%221724718932162%22,%22to%22:%221724858338297%22%7D%7D%7D&orgId=1)** for the same workload (see screenshot below) => GOOD * unchanged [memory...

Plan: * Roll the non-validating mode into more prod regions this week. * https://github.com/neondatabase/infra/pull/1883

Results from rollout shared in [this Slack thread](https://neondb.slack.com/archives/C033RQ5SPDH/p1725530975830579) tl;dr: * halved the PS PageCache eviction rate, and stabilized it a lot * halved the metric "wall clock time spent on...

> Could we rework this to avoid building the monolithic ancestor refs map? Yes, but at the cost of having to first build global list of tenant shards, then go...

this week: plumb through RequestContext on read path Also, Vlad informed me that the switch to vectored get for all Timeline::get means that we'll stop using the PageCache for user...

This week: * conclude VirtualFile RequestContext'ification * slab-allocated RequestContext * include Tenant/Shard/Timeline in RequestContext * use that for VirtualFile metrics * review whether/where PS-PageCache-ing of data blocks is still relevant...