neon
neon copied to clipboard
pageserver: slow basebackup when there are too many aux files
https://neondb.slack.com/archives/C03438W3FLZ/p1728330685012419
The devprod team has a testing project that cannot start because basebackup is too slow. Pageserver takes ~5min to scan all aux files, and no data is sent over the wire protocol, causing compute timeout.
for the project in staging, decide to extend start timeout so that at least it could start
this week: continue investigation
Running the previous aux file pagebench does not see any perf regressions with 10k files cargo run --bin pagebench aux-files.
So I guess this tenant experiencing slow basebackup is more related to overwriting existing aux key-value pairs / due to inefficiency in layer structure, instead of vectored read path perf issues. Need to look into the layer map and reproduce it.
for this tenant, no aux image file was generated since LSN 00000210492E07B8 (generation 000001fa).
By force running manual compaction, this can be resolved -- I can run this in staging while keep investigating locally. Seems that aux compaction is not triggered correctly in this specific case.
Basebackup is unstuck for that project (temporarily), compute cannot start due to wal_level (compute team taking over to investigate) -- still need to investigate why image layers are not generated.
this week: investigate if we can improve read path by not tracking keys on sparse keyspace
another interesting observation is that the aux file layers are usually super small after L0->L1 compaction:
-rw-r--r-- 1 skyzh staff 224K Nov 4 15:46 030000000000000000000000000000000001-62000002011CDE14934B1DC19112DDCD798B__00000216783D4AC9-0000021678403A51-v1-000001fc
-rw-r--r-- 1 skyzh staff 224K Nov 4 15:46 030000000000000000000000000000000001-62000002011CDE14934B1DC19112DDCD798B__0000021678403A51-00000216784329F1-v1-000001fc
-rw-r--r-- 1 skyzh staff 224K Nov 4 15:46 030000000000000000000000000000000001-62000002011CDE14934B1DC19112DDCD798B__00000216784329F1-0000021678461551-v1-000001fc
-rw-r--r-- 1 skyzh staff 224K Nov 4 15:46 030000000000000000000000000000000001-62000002011CDE14934B1DC19112DDCD798B__0000021678461551-00000216784904F1-v1-000001fc
-rw-r--r-- 1 skyzh staff 208K Nov 4 15:46 030000000000000000000000000000000001-62000002011CDE14934B1DC19112DDCD798B__00000216784B9FC9-00000216784E52A1-v1-00000200
-rw-r--r-- 1 skyzh staff 224K Nov 4 15:27 030000000000000000000000000000000001-62000002011CDE14934B1DC19112DDCD798B__000002207A723049-000002207A751441-v1-00000209
-rw-r--r-- 1 skyzh staff 24K Nov 4 15:27 030000000000000000000000000000000002-630000000000000000000000000000010000__00000210492E07B8-v1-000001fa
we have two fixes: make read path faster, and make compaction more aggressive (while it shouldn't affect the amplification b/c the aux files are really small after making the updates into deltas)