neon icon indicating copy to clipboard operation
neon copied to clipboard

Neon handling of a large table

Open trungda opened this issue 5 months ago • 4 comments

Note: This is a long issue. We would like to collect early feedback on the proposed directions before diving too deeply into any specific area of improvement.

Background

We ingested a table with one secondary index into a single standard Neon cluster:

  • Row count: 110,728,822,784 rows
  • Size: 16TB
  • Secondary index size: 13TB

We are running a single tenant on a Pageserver with 42TB of SSD storage and 24 cores. Resource usage on the Pageserver is never saturated.

Configuration

We are running a single shard pageserver.

…
disk_usage_based_eviction = { max_usage_pct = 90, min_avail_bytes = 4200000000000, period = '1m' }                                                                    
tenant_config = { pitr_interval = '1 hour' }
…

Status

After ingestion and index creation, the L1 layers have accumulated to a degree that makes reads extremely slow.

Specifically, we enabled a workload that generated about 1200 GetPage@LSN requests per second, and collected the following stats:

  • P99 GetPage@LSN: 2.5s
  • P99 number of deltas per read: > 256 (largest bucket)
  • Layer count
    • L1: 175K (40TB)
    • Images: 133K (10TB)

Investigation

Our hypotheses:

  • A large number of L1 layers causes high tail read latency;
  • A prolonged ingestion period led to this accumulation of the L1 layers;
  • Auto-compaction failed to compact these layers effectively;

To explore this further, we captured a snapshot of layer files stored locally in the Pageserver. One key range stood out with an unusually large number of LSN ranges (L1):

000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002

There are 2408 L1 files for this key range:

000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00002D2B06FE7E81-00002D2BB7E3EA59-v1-00000001
…
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E79B55FEA09-00003E7A53EDE611-v1-00000001
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E7A53EDE611-00003E7AF27BFD19-v1-00000001

Each file is about 9MB. Upon learning more about the key format, we found that this range is a metadata range that contains relsize and aux files.

We don’t see a significant skew on other ranges but are happy to provide more detail on them if needed.

Actions

To reduce the number of L1 files in this range, we attempted a manual compaction:

curl -X PUT 'http://hostname:9898/v1/tenant/cae15f273a43f591dc980aab10ca7c21/timeline/bf84cf04d98d17798586d6a322630578/compact' -H 'Content-Type: application/json' -d '{"compact_key_range": {"start": "000000067F00004049000040500004D1DF1A", "end": "030000000000000000000000000000000002"}}'

Since the legacy compaction does not support range-based compaction, we switched to the new tiered_compaction algorithm.

The tiered_compaction job is now running. It is processing ~200K jobs, with each job taking about 7 seconds to complete. The estimated time to completion is around 16 days. We hope this will reduce the number of L1 files to a manageable level so that we can re-enable read traffic on the cluster.

Questions

  • Would a metadata range with a long history cause high tail latency on GetPage@LSN on the data path?
  • What could have led to this state, where one range with a long history without compaction?
  • Are there any key metrics that we should have paid attention to during the ingestion?

Proposals

Compaction

We have identified several points in compaction paths that can be parallelized:

  • tiered_compaction job execution: Would help reduce the compaction time;

  • image creation Would help reduce the image creation time from P99@16minutes;

  • While the code itself can be parallelizable, the compaction job is so CPU heavy that it might starve other jobs with the current execution model (even with FuturesUnordered). We propose:

  • Introducing a dedicated thread pool (or runtime) for compaction jobs to isolate and parallelize them better.

  • Making compaction jobs resumable. Currently, if the Pageserver restarts mid-run, compaction state is lost. This is especially painful for long-running jobs like ours.

  • In legacy compaction, we always run different layers of compaction and even GC in order. Theoretically, these steps are independent and can be divided into individual steps to improve the efficiency.

  • Improve eviction algorithm: https://github.com/neondatabase/neon/issues/12123

Storage broker

  • Not related to this particular large table but one more thing we noticed is that pageserver has a hard dependency on storage-broker in recent commits. Can we make this a soft dependency? The behavior we saw is that pageserver cannot advance the persistent_lsn when it fails to talk to storage-broker. This in turn makes it not possible to ack WAL from Safekeeper. This led to Safekeeper not able to delete its WAL -> Safekeeper ran out of disk.

CC: @jcsp @skyzh

trungda avatar Jun 18 '25 00:06 trungda

You don't mention how the tenant is sharded. The pageserver isn't meant to handle arbitrarily large shards -- large databases should be broken up into many shards, even if you are only using one physical pageserver.

jcsp avatar Jun 18 '25 08:06 jcsp

You don't mention how the tenant is sharded. The pageserver isn't meant to handle arbitrarily large shards -- large databases should be broken up into many shards, even if you are only using one physical pageserver.

Thanks! I updated the description but yes, we are running a single shard PS.

We are still setting up our infra to run multi-shard PS so in the meantime we are trying to see how much we can push a single shard PS. That being said, some of the proposals still make sense even for multi shard, e.g., thread pool compaction, resumable compaction?

trungda avatar Jun 18 '25 13:06 trungda

Regarding concurrency: the way I think about compaction throughput is "how fast does it need to be to support peak write rate?". With sharding, as well as each shard being a smaller unit of space, it also only needs to handle a lower rate of MB/s for the total tenant to keep up with the ingest rate (as the peak WAL rate is global per tenant, not per shard). If one thread is fast enough to keep up, then it's not really desirable to use more threads, because that introduces new issues of compaction competing for resources with more latency-sensitive work.

On the other hand, resumable compaction is clearly useful because it avoids wasting work when we have to shutdown/migrate a shard during a long running compaction. It also enables us to do a better job of prioritisation, because we can do smaller batches of compaction work, and avoid a big long running low-priority GC compaction holding up a much more urgent L0-L1 compaction.

Chi has done some work on breaking up GC compaction into smaller phases with this in mind.

(but the primary thing to change about your setup is to use at least ~16 shards for a tenant this size)

jcsp avatar Jun 18 '25 16:06 jcsp

If one thread is fast enough to keep up, then it's not really desirable to use more threads, because that introduces new issues of compaction competing for resources with more latency-sensitive work.

Thanks! We will be trying to shard the existing tenant. We are not running many tenants on a single pageserver. Our setup is that we have 1 tenant running on a very powerful pageserver machine, and we want to utilize all the CPU on this host.

On the other hand, resumable compaction is clearly useful

Thank you! Is there anything we can help in this area?

trungda avatar Jun 18 '25 18:06 trungda