neon
neon copied to clipboard
Persist pg_stat informartion in PS
Problem
Statistic is saved in local file and so lost on compute restart.
Persist in in page server using the same AUX file mechanism used for replication slots
See more about motivation in https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789
Summary of changes
Persist postal file using AUX mechanism
Postgres PRs: https://github.com/neondatabase/postgres/pull/446 https://github.com/neondatabase/postgres/pull/445
Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.
Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above checklist
2958 tests run: 2835 passed, 0 failed, 123 skipped (full report)
Code coverage* (full report)
functions:32.7% (6904 of 21123 functions)lines:50.1% (54163 of 108175 lines)
* collected from Rust tests only
75c06e0ce0c18cd08794c9ac1781accf6e875bbe at 2024-06-29T11:44:02.686Z :recycle:
Thanks for the ping. Can we enforce a size limit of this WAL write on the postgres side? (i.e. check size in wallog_file_descriptor, and if it exceeds a threshold, log a warning and don't write it).
I'm thinking that since pgstat file is meant to be small, a limit shouldn't affect it, but it will protect us in case this generic "write any file to the WAL" function is misused in future.
Just check how large statistic file can be: I created 10000 tables, perform some manipulations with them and do vacuum analyze. Is it enough to force generation of large statistic file?
Size of pgstat.stat is 4Mb
To factor in the issues we've had with capacity/replay scaling, what do you think about having separate stores for "big values" (such as pgstats) and "small values" (such as logical replication state)?
Whatever the implementation on the pageserver evolves to, I think it will be helpful to store the ~< 8KiB items separately from the ~<4MiB items. This doesn't have to be a whole separate API, we just need some recognizable bit that the pageserver can use to store them separately.
Cross referencing with https://github.com/neondatabase/neon/pull/6874#pullrequestreview-1901765234 -- since that PR is writing image values regularly for the AuxFilesDirectory structure, it's important that we don't put the multi-megabyte stats value in the same structure, to avoid using unreasonable amounts of storage when logical replication is used on small tenants. The scenario described in that PR was 70,000 writes, so if we were writing a 4MB image every 1000 writes, that would be 280MB of storage for a tenant that could be as small as 30MB -- the rule of thumb I'm using here is that logical replication metadata should never account for the majority of storage in a tenant.
Cross referencing with #6874 (review) -- since that PR is writing image values regularly for the AuxFilesDirectory structure, it's important that we don't put the multi-megabyte stats value in the same structure, to avoid using unreasonable amounts of storage when logical replication is used on small tenants. The scenario described in that PR was 70,000 writes, so if we were writing a 4MB image every 1000 writes, that would be 280MB of storage for a tenant that could be as small as 30MB -- the rule of thumb I'm using here is that logical replication metadata should never account for the majority of storage in a tenant.
pgstat information is saved only on systems shutdown and unlike snapshots files it is the single key. So even if compute is restarted each 5 minutes, and there are 10k tables, still it could not cause storage bloating.
Please also notice that:
- 10k relations is really extreme case. It is very unlikely that any client will have such large number of relation at all. And for typical number of relations: ~100 - size of statistic is few kilobytes.
- This statistic is naturally versioned: if we have some static RO replica then it should take relevant statistic at this moment of time. Providing alternative versioning mechanism is IMHO overkill
- We in any case has DBDIR which contains serialised hash of all database relations. So in case of large number of relation storage bloating will be caused not by statistic but by DBDIR rewriting.
So even if compute is restarted each 5 minutes, and there are 10k tables, still it could not cause storage bloating.
It doesn't matter how often the compute is restarted, if its large stats object is written to the same AuxFileDirectory as all the logical replication keys: every 1024 writes to LR slots will trigger a large (>4MiB) image value being written.
10k relations is really extreme case. It is very unlikely that any client will have such large number of relation at all.
Let's design defensively so that we are not relying on good luck to avoid a situation where someone creates a lot of relations (big stats) and the enables logical replication (frequent updates).
We in any case has DBDIR which contains serialised hash of all database relations. So in case of large number of relation storage bloating will be caused not by statistic but by DBDIR rewriting.
dbdir is indeed an issue, but the dbdir key isn't re-written continuously as a result of logical replication slot updates, and each value in the dbdir is only a couple of bytes. The difference is that the AuxFilesDirectory is subject to frequent, continuous writes, so we must not put anything big in that structure because the frequency of writes will lead to repetition of that big object.
dbdir is indeed an issue, but the dbdir key isn't re-written continuously as a result of logical replication slot updates, and each value in the dbdir is only a couple of bytes. The difference is that the AuxFilesDirectory is subject to frequent, continuous writes, so we must not put anything big in that structure because the frequency of writes will lead to repetition of that big object.
Yes, it is true. Size of each hash map element is only 9 bytes. But still storage size of creation of 10k relations is 4.4Gb. I do not think that it is unacceptable, but it is 1000x larger than size of pgstat file which is written only once at compute shutdown.
Generally LGTM for the storage part of the change, but for now, AUX file v2 is not enabled for all regions, so I don't plan to get this merged for now, unless there could be a flag to control whether or not to enable this feature.
I see the Postgres-side of the pull request writes pg_stat data during shutdown checkpoint, so I assume that's not a lot of write traffic and page server should be able to handle that relatively easily. Please correct me if I'm wrong :)
Yes, pgstat data is written on shutdown checkpoint. Normal size of this file is about 64kb. For 10k relations - 4Mb. As far as computes are suspended relatively rarely, I also do not expect some sigficant extra load on PS.
Also, I didn't see pg_stat-related files getting encoded in the aux file v2 mapping? What are the paths of the files produced by pg_stat, and probably we need to assign a prefix ID to that? (see
encode_aux_file_keyinaux_file.rs)
Sorry, I missed it. Fixed.