pageserver: understand resident size spike after deployment + fix it (likely image compaction)
https://neondb.slack.com/archives/C03H1K0PGKH/p1744773290457459
We get resident size spikes right after releases likely due to image compaction pulling a lot of layers. If it reaches the disk eviction threshold it might affect availability of the tenants.
Perhaps something we can observe via existing layer visibility mechanism: we should mark older layers as not-visible when they are covered by image layers, but:
- maybe something is broken there
- maybe computes are reading from historic LSNs before image layers, marking layers visible again?
- maybe something other than compaction is touching layers somehow?
We lack a good automated test that demonstrates this "storage bloat across restarts" behavior.
We might benefit from a little more observability around layer visibility + compaction: like, what is the visible size before/after the compaction? For layer files we read during compaction, what was their visibility before/after touching them?
staging metrics reveals initial logical size calculation is the reason; deploying the metrics to prod this week
The spike in some of the regions are right before the deployments so we don't have metrics yet this week :(
So every deployment will cause surge of on-demand downloads for every task type. Two options:
- Find a way to pace those tasks so that they don't create pressure upon startup. i.e., by caching the previous states on the disk.
- Set a different evict period for all task kinds other than pageserver reads, so that it does not take too much space.
If we look closely at the storage used during restarts, it seems that we clean up a lot of data when draining the nodes; and later on we need to redownload them.
Let's look at a specific pageserver which restart happens at 8:58
It's already deleting things when the rollout starts ~30min before its restart. Several minutes after its restart, it deletes things of 10% of the disk space, and then the storage usage starts to grow.
Looking at on-demand download triggering reasons, there are already a lot of downloads caused by initial logical size calculation before its restart; after its restart, the main contribution for the space growing is compaction.
TL;DR: the root cause of space usage surge is compaction; but it's also worth investigating why we delete things that we potentially will use later during rollouts.
Looked at all surges when there're no deployments: they're all caused by user workloads. So we can constrain this issue only in investigating space surge during deployment/restarts.
Looks like a problem with secondary downloader:
2025-06-03T11:56:36.131173Z INFO secondary_download{tenant_id=X shard_id=0000}: Removing secondary local layer 000000000000000000000000000000000000-000000067F000040050000407B0300000000__00000057DD68D0F8 because it's absent in heatmap
Right after the restart when the tenant is still attached as a secondary, we will remove a lot of layers that will still be used by logical size calculation / compaction / etc; and after the tenant gets attached back, we have to download a lot of things.
I think we can have a quick fix that avoids deleting anything when the tenant is attached as a secondary for fewer than 5 minutes?
Vlad got a fix for this issue: https://github.com/neondatabase/neon/pull/12206
We should see the dip issue gets resolved this week along with the deployment, and decide the next steps.
We still see some dips during the deployment (and spike after the deployment), maybe more issues underhood.
This issue was moved to Jira: LKB-1680