neon Limit number of AUX files deltas to reduce reconstruct time

Problem

After commit [840abe395413508db40d0428e30f09343c051fed] (store AUX files as deltas) we avoid quadratic growth of storage size when storing LR snapshots but get quadratic slowdown of reconstruct time. As a result storing 70k snapshots at my local Neon instance took more than 3 hours and starting node (creation of basecbackup): ~10 minutes. In prod 70k AUX files cause increase of startup time to 40 minutes:

https://neondb.slack.com/archives/C03F5SM1N02/p1708513010480179

Summary of changes

Enforce storing full AUX directory (some analog of FPI) each 1024 files. Time of creation 70k snapshots is reduced to 6 minutes and startup time - to 1.5 minutes (100 seconds).

Checklist before requesting a review

[ ] I have performed a self-review of my code.
[ ] If it is a core feature, I have added thorough tests.
[ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
[ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

[ ] Do not forget to reformat commit message to not include the above checklist

Feb 22 '24 07:02 knizhnik

2418 tests run: 2295 passed, 0 failed, 123 skipped (full report)

Flaky tests (3)

Postgres 16

test_ts_of_lsn_api: debug
test_delete_timeline_client_hangup: debug
test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

functions: 28.8% (6781 of 23527 functions)
lines: 47.7% (41251 of 86568 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
e57afa3b3811beab8da03a4274a83bbdb134e0f7 at 2024-02-24T07:39:38.060Z :recycle:}

Feb 22 '24 08:02 github-actions[bot]

Questions:

How do we write an automated test for this?
Is 70K writes a healthy postgres behavior, or a pathological behavior? If the latter, how do we put limits on it on the compute side?

Feb 22 '24 09:02 jcsp

Hmm, why is it quadratic at basebackup? AFAICS you need to reconstruct the aux-files key-value pair only once, which is O(n) where n is the number of delta records since last image.

How is this different from Postgres updating a single page very frequently? Do we have the same problem there? We do have a performance test (test_hot_page.py) for that.

Feb 22 '24 09:02 hlinnaka

In the worst cases of the #6626, I recall that we had a 30MB logical size tenant with 2TB of physical storage. If we're making image layers every 1024 writes, then we would be re-introducing the storage bloat issue to some extent: that tenant would still end up with a multi-gigabyte physical size in spite of having a tiny logical size.

Writing out the periodic image value is only okay if we also have a limit on the number of aux files. e.g. if there is just one slot, then it's painless to write an image value every few thousand writes. If there are many thousands of keys in this map, then writing out those periodic image values re-introduces the space amplification problem.

To give examples: assuming a 100 byte size of kv pairs in the map:

A) Okay: a single key in the map, written 70,000 times: that's 69000 deltas and 1000 100-byte images, total size ~6MB
B) Not okay: 70,000 unique keys, each written once: that's 69000 deltas and 1000 images that are of size ~(70000 * 100 / 2), total size 23GiB.

The only way we can ensure we're in the (A) category is to have some robust limit on the number of keys.

Feb 22 '24 09:02 jcsp

Hmm, why is it quadratic at basebackup? AFAICS you need to reconstruct the aux-files key-value pair only once, which is O(n) where n is the number of delta records since last image.

Each step in reconstruction deserializes and reserializes the whole map (the working BufMut carries it from one delta to the next when applying.

It can be made O(N) by changing the reconstruction code to do something smarter like building up the deltas in the BufMut in some intermediate form, then building the actual map one time. Or teach the reconstruction code to modify the binary encoding of a map in-place rather than decoding/encoding the whole thing.

How is this different from Postgres updating a single page very frequently? Do we have the same problem there? We do have a performance test (test_hot_page.py) for that.

It isn't, really. We lean heavily on postgres being "reasonable" in the number of overwrites of the same page: i.e. that by the time the number becomes a problem, we should have built up enough L0s that we start compacting. I imagine that in the case Konstantin is testing by hand, there isn't enough other stuff going on to generate a deep enough stack of L0s to let compaction do its thing.

Feb 22 '24 09:02 jcsp

Questions:

How do we write an automated test for this?

There is test_layer_bloating.py reproducing it with pg_log_standby_snapshot() function

Is 70K writes a healthy postgres behavior, or a pathological behavior? If the latter, how do we put limits on it on the compute side?

I am not sure whether it is normal behaviour or not, but it quite easy to reproduce:

create subscription from one database
stop this database
create subscription from other database
wait for some time (70k*15sec = 12 days)

Feb 22 '24 19:02 knizhnik

Hmm, why is it quadratic at basebackup? AFAICS you need to reconstruct the aux-files key-value pair only once, which is O(n) where n is the number of delta records since last image.

It is not quadratic at basebackup extraction - it just has to traverse and append 70k deltas=files. This is why it taks about 7 minutes.

It is quadratic on ingesting this files: to append N+1'st delta we need first to extract aux directory, i.e. apply all previous N deltas. This is why it is N*N and applying this 7Mb of WAl takes 3 hours.

Feb 22 '24 19:02 knizhnik

How is this different from Postgres updating a single page very frequently? Do we have the same problem there? We do have a performance test (test_hot_page.py) for that.

There are two main differences:

Checkpoint periodical force creation of FPI restricting number of delta which ash to be applied. It is actually not necessary for Neon architecture to perform checkpoints at all (because we do not have problem with non-atomic page write), but may be it is still good idea to perform checkpoints because it limits reconstruct time.
Each update produce new version, so very soon you exhaust space on the page and need to allocate new one,. Certainly Postgrs is able to perform some "micro vacuum" in =place in case of hot updates, but still it is impossible by updating the same tuple multiple times construct arbitrary long chain of WAL records (even if do not take in account compaction and image layer generation).

Feb 22 '24 19:02 knizhnik

In the worst cases of the #6626, I recall that we had a 30MB logical size tenant with 2TB of physical storage. If we're making image layers every 1024 writes, then we would be re-introducing the storage bloat issue to some extent: that tenant would still end up with a multi-gigabyte physical size in spite of having a tiny logical size.

Writing out the periodic image value is only okay if we also have a limit on the number of aux files. e.g. if there is just one slot, then it's painless to write an image value every few thousand writes. If there are many thousands of keys in this map, then writing out those periodic image values re-introduces the space amplification problem.

To give examples: assuming a 100 byte size of kv pairs in the map:

A) Okay: a single key in the map, written 70,000 times: that's 69000 deltas and 1000 100-byte images, total size ~6MB

B) Not okay: 70,000 unique keys, each written once: that's 69000 deltas and 1000 images that are of size ~(70000 * 100 / 2), total size 23GiB.

The only way we can ensure we're in the (A) category is to have some robust limit on the number of keys.

Why 1000 images if are are saving one image per 1024 deltas? So for 70k snapshots there will be 70 images. 7070k100=0.5Gb. Not so small, but not so fatal. Especially taken in account that we in any case will try to avoid creation of such larger number of snapshots (PR #6739)

Feb 22 '24 19:02 knizhnik

It is quadratic on ingesting this files: to append N+1'st delta we need first to extract aux directory, i.e. apply all previous N deltas. This is why it is N*N and applying this 7Mb of WAl takes 3 hours.

Oops. Can we fix that? You shouldn't need to reconstruct the previous image to create a new delta record.

Feb 22 '24 20:02 hlinnaka

It is quadratic on ingesting this files: to append N+1'st delta we need first to extract aux directory, i.e. apply all previous N deltas. This is why it is N*N and applying this 7Mb of WAl takes 3 hours.

Oops. Can we fix that? You shouldn't need to reconstruct the previous image to create a new delta record.

To be precise, we cache AUX_DIR in DatadirModification, so w have to extract only once when Timeline.begin_modification() is called. And it is called once per WAL segment (~128kb) received from SK. In my test (creation of snapshots using pg_log_standby_snapshot() WAL consists just of snapshot records and and so it is done several hundreds times per 70k. But looks like it is enough to make PS consume 100% CPU for three hours. In more realistic scenario - when snapshot produced each 15 seconds is "hidden" in tons of other WAL records, we will really have to read and reconstruct directory for each uploaded snapshot.

Feb 22 '24 20:02 knizhnik

Why 1000 images if are are saving one image per 1024 deltas? So for 70k snapshots there will be 70 images. 7070k100=0.5Gb. Not so small, but not so fatal. Especially taken in account that we in any case will try to avoid creation of such larger number of snapshots (PR https://github.com/neondatabase/neon/pull/6739)

Right, my maths was wrong there, but 500MB is still a 10x amplification on a near-empty tenant with 30MB logical size that enables logical replication. I think the reality is worse: because we saw a tenant with logical size 30MB and physical size 2TiB, I anticipate that if we merge this change, that tenant's size will go to ~2GiB. So yes, it's 1000x less, but it's still an enormous amplification, that's unacceptable for tenants that are on the free tier.

There is test_layer_bloating.py reproducing it with pg_log_standby_snapshot() function

Yes, but it passed before this PR (and also before #6742, right?), so I don't think its assertions are covering the storage piece of this issue.

Is 70K writes a healthy postgres behavior, or a pathological behavior? If the latter, how do we put limits on it on the compute side? I am not sure whether it is normal behaviour or not, but it quite easy to reproduce: create subscription from one database stop this database create subscription from other database wait for some time (70k*15sec = 12 days)

I should have asked a slightly different question: how many keys can a healthy system produce? 70k writes to the same key is fine, but 1 write to 70k keys is not. I think we urgently need to move from uncertainty over number of keys to defining a specific maximum, and enforcing that in postgres: then we can reason about the worst-case behavior for the storage.

We need maximums for:

Number of keys
max size of key

Once we have that, we can make a well informed decision on whether the cost of a change like the one in this PR is acceptable -- I think that gives us a way to move forward with this change.

Feb 23 '24 15:02 jcsp

Right, my maths was wrong there, but 500MB is still a 10x amplification on a near-empty tenant with 30MB logical size that enables logical replication. I think the reality is worse: because we saw a tenant with logical size 30MB and physical size 2TiB, I anticipate that if we merge this change, that tenant's size will go to ~2GiB. So yes, it's 1000x less, but it's still an enormous amplification, that's unacceptable for tenants that are on the free tier.

Actually we should divide it by 2:) So 250Mb or 5x amplification. We can easily replace 1024 with 8192 or any other period...

But please compare 3 hours of 100% CPU time and 7 minutes. If our image layer generation mechanism will be more responsive, we may be do not need this optimization. But right now IMHO using extra 250Mb on disk is less critical than spending 3 hours in CPU.

Feb 23 '24 15:02 knizhnik

Yes, but it passed before this PR (and also before https://github.com/neondatabase/neon/pull/6742, right?), so I don't think its assertions are covering the storage piece of this issue.

This PR actually checks that huge image layers are not generated. It was fixed some time ago. And now with https://github.com/neondatabase/neon/pull/6742 it is not relevant at all. But after #6742 this test is not usable at all (as I wrote - it took 3 hours to PS to ingest WAL).

Feb 23 '24 16:02 knizhnik

I should have asked a slightly different question: how many keys can a healthy system produce? 70k writes to the same key is fine, but 1 write to 70k keys is not. I think we urgently need to move from uncertainty over number of keys to defining a specific maximum, and enforcing that in postgres: then we can reason about the worst-case behavior for the storage.

I think you asking only about AUX files. Right now we are storing AUX files of three kinds:

snapshots
slot state
replication origins 1 and 2 are produced together - once per 15 seconds. Bit only snapshots has unique names (this is why my proposal was to remove LSN from its name and store it inside value, but you solution with deltas seems to be more elegant ... although cause this reconstruction problems). 3 also uses the same name and produced less frequently. If we support persisting some other data files, like postal statistic or pg_preewarm case state, then they are produced only at system shutdown.

So number of AUX files is expected to be very small and frequency of their generation - not so large.This is why I could not imagine this problem with snapshots:(

Feb 23 '24 16:02 knizhnik

Number of keys with #6739 already merged number of snapshots is limited by 300. So total number of keys can be estimated as few hundreds. But please notice that it doesn't limit number of deltas. If we do not periodically stored full directory, then we can still suffer from 40 minutes of preparing basebackup.

max size of key It is more difficult to estimate. Right now sizeof typical snapshot is ~100 bytes. But in case of larger number of active transaction is can be larger. And if we are going to persist statistic to shared buffers state, then it can be almost arbitrary larger (but still I do not thin that it will be larger than few Mb).

Once we have that, we can make a well informed decision on whether the cost of a change like the one in this PR is acceptable -- I think that gives us a way to move forward with this change.

Having 40 minutes for node startup is already mont acceptable. And I see only three ways to fix it:

Produce image layer more frequently when it is needed. It will be hard to impelement and it may increase write amplification and have negative impact on total system throughput.
Periodically write "FPI". As it is done now by Postgres checkpointer. This is what I tried to implement with this PR.
As I suggested before: remove LSN from file name and store it separately. In this case number of files will be small (~10) so no need to use deltas.

Feb 23 '24 16:02 knizhnik

with https://github.com/neondatabase/neon/pull/6739 already merged number of snapshots is limited by 300. So total number of keys can be estimated as few hundreds. But please notice that it doesn't limit number of deltas. If we do not periodically stored full directory, then we can still suffer from 40 minutes of preparing basebackup.

Okay, so can we robustly count on the 300 number? Like, if I put an assert!(dir.files.len() < 301) is that valid? I'm trying to distinguish the general comments about expected counts from what is fully guaranteed. The numbers we need are the guaranteed limits.

We don't need to break this down into different postgres use cases: in the pageserver, we just need the total number of keys in the map.

It is more difficult to estimate. Right now sizeof typical snapshot is ~100 bytes. But in case of larger number of active transaction is can be larger. And if we are going to persist statistic to shared buffers state, then it can be almost arbitrary larger (but still I do not thin that it will be larger than few Mb).

Let's leave the statistics aside for the moment and stay on aux files. The problem we have is that I read "in case of larger number of active transaction is can be larger" as meaning "unbounded". I don't know if postgres has a limit on the number of active transactions (maybe it doesn't?). To build a robust system there needs to be some limit on this: otherwise we have to build a backend that can support an arbitrarily large collection of keys.

Can you pick a specific number that you would feel comfortable with in production? e.g. if we threw away extra keys once the map reached size 1000, would that work?

Having 40 minutes for node startup is already mont acceptable.

My primary concern is to ensure any LR-related bugs don't affect other tenants (e.g. by consuming excessive disk space). So my bar for what's acceptable is different for folks who have enrolled in the beta vs. for the system as a whole.

Feb 23 '24 17:02 jcsp

I just looked again at the change in this PR, and realized it's not emitting an image on every 1024 updates, but only when the key count crosses a multiple of 1024. So if the size of the map is somewhat steady with keys added/removed, it won't emit any image values at all.

For this to work without carrying more state about the number of delta values since last image value, one could use some stochastic approach to statistically emit an image roughly every N writes. The storage amplification could be bounded by using the size of the map to control how frequently we emit image values.

e.g. if someone has a 1000-value map, and emits images approximately every 1000 writes, then the approximate amplification in storage is 2x.

If we decided to tolerate a 10x amplification in storage (i.e. a 30MB free tier tenant takes 300MB storage, which is still unfortunate but at least bounded), then the calculation would be something like:

// Try to keep storarge amplification below this factor
const MAX_AMPLIFICATION_FACTOR = 10;
// Never bother emitting image values more often than this
const MIN_IMAGE_VAL_PERIOD = 1000;
let image_val_period = max(MIN_IMAGE_VAL_PERIOD, dir.files.len() * MAX_AMPLIFICATION_FACTOR);
if rand::threaD_rng().gen() % image_val_period = 0 {
 // emit image value
}

That's a huge kludge, but it would be a similar improvement to this PR's intent, while avoiding a storage amplification which is unbounded with the key count.

Feb 23 '24 18:02 jcsp

I just looked again at the change in this PR, and realized it's not emitting an image on every 1024 updates, but only when the key count crosses a multiple of 1024. So if the size of the map is somewhat steady with keys added/removed, it won't emit any image values at all.

Thank you for pointing it. It is a real problem - I forgot about deletes.

For this to work without carrying more state about the number of delta values since last image value, one could use some stochastic approach to statistically emit an image roughly every N writes. The storage amplification could be bounded by using the size of the map to control how frequently we emit image values.

Why not just to add update counter to AuxFiledDirectory? Extra 4 bytes fields will not cause some write amplification, will it?

e.g. if someone has a 1000-value map, and emits images approximately every 1000 writes, then the approximate amplification in storage is 2x.

If we decided to tolerate a 10x amplification in storage (i.e. a 30MB free tier tenant takes 300MB storage, which is still unfortunate but at least bounded), then the calculation would be something like:

Looks like we are trying to solve the problem which doesn't;t exists any more. With #6739 there are no more than 300 snapshot files. It means that full hash map size is not larger than 30kb (assuming size of snapshot is 100 bytes). And it is written once per 1000*15 = 4 hours. Can you call it "write amplification"?

// Try to keep storarge amplification below this factor
const MAX_AMPLIFICATION_FACTOR = 10;
// Never bother emitting image values more often than this
const MIN_IMAGE_VAL_PERIOD = 1000;
let image_val_period = max(MIN_IMAGE_VAL_PERIOD, dir.files.len() * MAX_AMPLIFICATION_FACTOR);
if rand::threaD_rng().gen() % image_val_period = 0 {
 // emit image value
}
That's a huge kludge, but it would be a similar improvement to this PR's intent, while avoiding a storage amplification which is unbounded with the key count.

Feb 23 '24 19:02 knizhnik

Let's leave the statistics aside for the moment and stay on aux files. The problem we have is that I read "in case of larger number of active transaction is can be larger" as meaning "unbounded". I don't know if postgres has a limit on the number of active transactions (maybe it doesn't?).

Number of active transaction is limited by number of connections. At it is now 100 by default and very unlikely that it will be set larger than 1000 (Postgres is not scaling well for large number of active transactions, so if it is necessary to support more active connections, it is better to use pgbouncer to map them to backends).

To build a robust system there needs to be some limit on this: otherwise we have to build a backend that can support an arbitrarily large collection of keys.

So, snapshot really can not be larger than few kilobytes. With statistic situation is more obscure - I am going to investigate it. It should depends on number of relations but I am not sure.

Feb 23 '24 19:02 knizhnik

My primary concern is to ensure any LR-related bugs don't affect other tenants (e.g. by consuming excessive disk space). So my bar for what's acceptable is different for folks who have enrolled in the beta vs. for the system as a whole.

I wonder what is the principle difference between LR bugs and any other bugs which may cause bloating of storage. One example of such bug we have discusses in another ticker:stairs of layers which caused x30 amplification of storage space.

So my bar for what's acceptable is different for folks who have enrolled in the beta vs. for the system as a whole.

For us the most important thing is customer's feedback. This incident with storage bloating cause a lot of troubles for us but is almost not noticeable for the customers (the fact that we have x100 write amplification is not visible for customers 0 they are not paying for resident size). But 40 minutes startup is show stopper for customer: from his point fire it just means that Neon doesn't work.

So th priority here seems to be obvious.

Feb 23 '24 19:02 knizhnik

I have slightly adjusted test_layer_bloating.py test by adding wait_for_flush_lsn. Now it is failed at main and pass in this PR illustrating the problem.

Feb 24 '24 07:02 knizhnik

Okay, so can we robustly count on the 300 number? Like, if I put an assert!(dir.files.len() < 301) is that valid?

Limit https://github.com/neondatabase/neon/pull/6739 is a bit softy, i.e. several minutes might pass before it takes place. Also, it limits only .snap files, but there are also heap rewrite files; typically there are not many of them, but if really wanted they can be likely abused. Plus there are some more (slots, origins), there a few of them normally, but similarly this probably can be abused. So it makes sense to me to have some additional restriction on pageserver side of value which presumably should never be reached, e.g. 10k. Violating it should be error, not panic (assert). Ensuring hard boundary on PG side is much harder.

Feb 26 '24 19:02 arssher

neon neon copied to clipboard

Limit number of AUX files deltas to reduce reconstruct time

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

2418 tests run: 2295 passed, 0 failed, 123 skipped (full report)

Postgres 16

Code coverage* (full report)

neon
neon copied to clipboard