neon pageserver: understand resident size spike after deployment + fix it (likely image compaction)

https://neondb.slack.com/archives/C03H1K0PGKH/p1744773290457459

We get resident size spikes right after releases likely due to image compaction pulling a lot of layers. If it reaches the disk eviction threshold it might affect availability of the tenants.

Apr 16 '25 17:04 skyzh

Perhaps something we can observe via existing layer visibility mechanism: we should mark older layers as not-visible when they are covered by image layers, but:

maybe something is broken there
maybe computes are reading from historic LSNs before image layers, marking layers visible again?
maybe something other than compaction is touching layers somehow?

We lack a good automated test that demonstrates this "storage bloat across restarts" behavior.

We might benefit from a little more observability around layer visibility + compaction: like, what is the visible size before/after the compaction? For layer files we read during compaction, what was their visibility before/after touching them?

Apr 22 '25 14:04 jcsp

staging metrics reveals initial logical size calculation is the reason; deploying the metrics to prod this week

May 05 '25 10:05 skyzh

The spike in some of the regions are right before the deployments so we don't have metrics yet this week :(

May 08 '25 11:05 skyzh

So every deployment will cause surge of on-demand downloads for every task type. Two options:

Find a way to pace those tasks so that they don't create pressure upon startup. i.e., by caching the previous states on the disk.
Set a different evict period for all task kinds other than pageserver reads, so that it does not take too much space.

May 13 '25 08:05 skyzh

If we look closely at the storage used during restarts, it seems that we clean up a lot of data when draining the nodes; and later on we need to redownload them.

Let's look at a specific pageserver which restart happens at 8:58

It's already deleting things when the rollout starts ~30min before its restart. Several minutes after its restart, it deletes things of 10% of the disk space, and then the storage usage starts to grow.

Looking at on-demand download triggering reasons, there are already a lot of downloads caused by initial logical size calculation before its restart; after its restart, the main contribution for the space growing is compaction.

TL;DR: the root cause of space usage surge is compaction; but it's also worth investigating why we delete things that we potentially will use later during rollouts.

Jun 02 '25 08:06 skyzh

Looked at all surges when there're no deployments: they're all caused by user workloads. So we can constrain this issue only in investigating space surge during deployment/restarts.

Jun 11 '25 06:06 skyzh

Looks like a problem with secondary downloader:

2025-06-03T11:56:36.131173Z  INFO secondary_download{tenant_id=X shard_id=0000}: Removing secondary local layer 000000000000000000000000000000000000-000000067F000040050000407B0300000000__00000057DD68D0F8 because it's absent in heatmap

Right after the restart when the tenant is still attached as a secondary, we will remove a lot of layers that will still be used by logical size calculation / compaction / etc; and after the tenant gets attached back, we have to download a lot of things.

I think we can have a quick fix that avoids deleting anything when the tenant is attached as a secondary for fewer than 5 minutes?

Jun 11 '25 06:06 skyzh

Vlad got a fix for this issue: https://github.com/neondatabase/neon/pull/12206

Jun 12 '25 07:06 skyzh

We should see the dip issue gets resolved this week along with the deployment, and decide the next steps.

Jun 30 '25 21:06 skyzh

We still see some dips during the deployment (and spike after the deployment), maybe more issues underhood.

Jul 02 '25 21:07 skyzh

This issue was moved to Jira: LKB-1680

Jul 18 '25 22:07 zenithdb-bot-dev[bot]