neon Persist pg_stat informartion in PS

trafficstars

Problem

Statistic is saved in local file and so lost on compute restart.

Persist in in page server using the same AUX file mechanism used for replication slots

See more about motivation in https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789

Summary of changes

Persist postal file using AUX mechanism

Postgres PRs: https://github.com/neondatabase/postgres/pull/446 https://github.com/neondatabase/postgres/pull/445

Checklist before requesting a review

[ ] I have performed a self-review of my code.
[ ] If it is a core feature, I have added thorough tests.
[ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
[ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

[ ] Do not forget to reformat commit message to not include the above checklist

Feb 01 '24 06:02 knizhnik

2958 tests run: 2835 passed, 0 failed, 123 skipped (full report)

Flaky tests (3)

Postgres 16

test_wal_restore_http[True]: debug

Postgres 15

test_isolation[4]: release
test_wal_restore_http[True]: debug

Code coverage* (full report)

functions: 32.7% (6904 of 21123 functions)
lines: 50.1% (54163 of 108175 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
75c06e0ce0c18cd08794c9ac1781accf6e875bbe at 2024-06-29T11:44:02.686Z :recycle:}

Feb 01 '24 11:02 github-actions[bot]

Thanks for the ping. Can we enforce a size limit of this WAL write on the postgres side? (i.e. check size in wallog_file_descriptor, and if it exceeds a threshold, log a warning and don't write it).

I'm thinking that since pgstat file is meant to be small, a limit shouldn't affect it, but it will protect us in case this generic "write any file to the WAL" function is misused in future.

Feb 21 '24 10:02 jcsp

Just check how large statistic file can be: I created 10000 tables, perform some manipulations with them and do vacuum analyze. Is it enough to force generation of large statistic file? Size of pgstat.stat is 4Mb

Feb 24 '24 07:02 knizhnik

To factor in the issues we've had with capacity/replay scaling, what do you think about having separate stores for "big values" (such as pgstats) and "small values" (such as logical replication state)?

Whatever the implementation on the pageserver evolves to, I think it will be helpful to store the ~< 8KiB items separately from the ~<4MiB items. This doesn't have to be a whole separate API, we just need some recognizable bit that the pageserver can use to store them separately.

Feb 26 '24 08:02 jcsp

Cross referencing with https://github.com/neondatabase/neon/pull/6874#pullrequestreview-1901765234 -- since that PR is writing image values regularly for the AuxFilesDirectory structure, it's important that we don't put the multi-megabyte stats value in the same structure, to avoid using unreasonable amounts of storage when logical replication is used on small tenants. The scenario described in that PR was 70,000 writes, so if we were writing a 4MB image every 1000 writes, that would be 280MB of storage for a tenant that could be as small as 30MB -- the rule of thumb I'm using here is that logical replication metadata should never account for the majority of storage in a tenant.

Feb 26 '24 19:02 jcsp

Cross referencing with #6874 (review) -- since that PR is writing image values regularly for the AuxFilesDirectory structure, it's important that we don't put the multi-megabyte stats value in the same structure, to avoid using unreasonable amounts of storage when logical replication is used on small tenants. The scenario described in that PR was 70,000 writes, so if we were writing a 4MB image every 1000 writes, that would be 280MB of storage for a tenant that could be as small as 30MB -- the rule of thumb I'm using here is that logical replication metadata should never account for the majority of storage in a tenant.

pgstat information is saved only on systems shutdown and unlike snapshots files it is the single key. So even if compute is restarted each 5 minutes, and there are 10k tables, still it could not cause storage bloating.

Please also notice that:

10k relations is really extreme case. It is very unlikely that any client will have such large number of relation at all. And for typical number of relations: ~100 - size of statistic is few kilobytes.
This statistic is naturally versioned: if we have some static RO replica then it should take relevant statistic at this moment of time. Providing alternative versioning mechanism is IMHO overkill
We in any case has DBDIR which contains serialised hash of all database relations. So in case of large number of relation storage bloating will be caused not by statistic but by DBDIR rewriting.

Feb 27 '24 15:02 knizhnik

So even if compute is restarted each 5 minutes, and there are 10k tables, still it could not cause storage bloating.

It doesn't matter how often the compute is restarted, if its large stats object is written to the same AuxFileDirectory as all the logical replication keys: every 1024 writes to LR slots will trigger a large (>4MiB) image value being written.

10k relations is really extreme case. It is very unlikely that any client will have such large number of relation at all.

Let's design defensively so that we are not relying on good luck to avoid a situation where someone creates a lot of relations (big stats) and the enables logical replication (frequent updates).

We in any case has DBDIR which contains serialised hash of all database relations. So in case of large number of relation storage bloating will be caused not by statistic but by DBDIR rewriting.

dbdir is indeed an issue, but the dbdir key isn't re-written continuously as a result of logical replication slot updates, and each value in the dbdir is only a couple of bytes. The difference is that the AuxFilesDirectory is subject to frequent, continuous writes, so we must not put anything big in that structure because the frequency of writes will lead to repetition of that big object.

Feb 27 '24 17:02 jcsp

dbdir is indeed an issue, but the dbdir key isn't re-written continuously as a result of logical replication slot updates, and each value in the dbdir is only a couple of bytes. The difference is that the AuxFilesDirectory is subject to frequent, continuous writes, so we must not put anything big in that structure because the frequency of writes will lead to repetition of that big object.

Yes, it is true. Size of each hash map element is only 9 bytes. But still storage size of creation of 10k relations is 4.4Gb. I do not think that it is unacceptable, but it is 1000x larger than size of pgstat file which is written only once at compute shutdown.

Feb 27 '24 19:02 knizhnik

Generally LGTM for the storage part of the change, but for now, AUX file v2 is not enabled for all regions, so I don't plan to get this merged for now, unless there could be a flag to control whether or not to enable this feature.

I see the Postgres-side of the pull request writes pg_stat data during shutdown checkpoint, so I assume that's not a lot of write traffic and page server should be able to handle that relatively easily. Please correct me if I'm wrong :)

Yes, pgstat data is written on shutdown checkpoint. Normal size of this file is about 64kb. For 10k relations - 4Mb. As far as computes are suspended relatively rarely, I also do not expect some sigficant extra load on PS.

Also, I didn't see pg_stat-related files getting encoded in the aux file v2 mapping? What are the paths of the files produced by pg_stat, and probably we need to assign a prefix ID to that? (see encode_aux_file_key in aux_file.rs)

Sorry, I missed it. Fixed.

Jun 24 '24 18:06 knizhnik

neon neon copied to clipboard

Persist pg_stat informartion in PS

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

2958 tests run: 2835 passed, 0 failed, 123 skipped (full report)

Postgres 16

Postgres 15

Code coverage* (full report)

neon
neon copied to clipboard