neon Aux files rfc

This RFC discusses how to better store extra files in Neon's Postgres setup, like replication info and stats files, which aren't fully covered by the current system. It looks at options like using cloud storage (S3FS), dedicated storage (EBS), shared file systems (EFS), and a centralized service, weighing their good and bad points.

Rendered

Feb 04 '24 23:02 kelvich

2400 tests run: 2283 passed, 0 failed, 117 skipped (full report)

Flaky tests (2)

Postgres 16

test_secondary_mode_eviction: release
test_empty_branch_remote_storage_upload: debug

Code coverage (full report)

functions: 54.3% (11470 of 21106 functions)
lines: 81.6% (64425 of 78972 lines)

_{The comment gets automatically updated with the latest test results
42ad8df3a9cfb824287845bdd0af9e741fc75747 at 2024-02-08T16:45:22.419Z :recycle:}

Feb 05 '24 00:02 github-actions[bot]

At various places:

"chain replication"

The PostgreSQL term for this is "cascading replication" IIRC.

Feb 05 '24 19:02 MMeent

TODO: A detailed list of files that can block start needs to be compiled. TODO: check if snapshot files need to be read at startup.

Walproposer uses a replication slot, but its persistency is not needed, so we can add a hack to keep it in a separate directory. (Removing it completely is also not impossible, though creates unlikely corner case when commit_lsn lags behind so much that WAL is recycled before it is sent to at least some available sk so that compute has to be restarted). However, StartupReplicationSlots which loads all slots runs very early, so without changes it will slow down the start. But then again original issue was about bloating of snapshot files, not replslot files.

I'd advocate for 'Logical messages + basebackup' with proper implementation on the storage side (avoiding full rewrites) to avoid external dependencies and auth issues, but logical decoding on standby seems to be important, unlike chained replication -- I do not see any application of it for neon, we don't connect replicas to compute directly anyway. So seems like s3fs option is the most reasonable here.

Though likely we'd need to have somewhat both actually, as similar issue should be fixed for RelDirectory/DbDirectory (DBDIR_KEY).

Feb 05 '24 21:02 arssher

TODO: A detailed list of files that can block start needs to be compiled. TODO: check if snapshot files need to be read at startup.

Walproposer uses a replication slot, but its persistency is not needed, so we can add a hack to keep it in a separate directory. (Removing it completely is also not impossible, though creates unlikely corner case when commit_lsn lags behind so much that WAL is recycled before it is sent to at least some available sk so that compute has to be restarted). However, StartupReplicationSlots which loads all slots runs very early, so without changes it will slow down the start. But then again original issue was about bloating of snapshot files, not replslot files.

I'd advocate for 'Logical messages + basebackup' with proper implementation on the storage side (avoiding full rewrites) to avoid external dependencies and auth issues, but logical decoding on standby seems to be important, unlike chained replication -- I do not see any application of it for neon, we don't connect replicas to compute directly anyway. So seems like s3fs option is the most reasonable here.

Though likely we'd need to have somewhat both actually, as similar issue should be fixed for RelDirectory/DbDirectory (DBDIR_KEY).

AUX_FILE mechanism takes in account only logical replication slots, so walproposer slot is not relevant.

Feb 06 '24 08:02 knizhnik

AUX_FILE mechanism takes in account only logical replication slots, so walproposer slot is not relevant.

ah, true

Feb 06 '24 09:02 arssher

As far as I understand, DFS is still considered as "mission impossible" because:

If we have single devices shared by all tenants, then we will have to create thousands of users to provide proper isolation and quotas.
If we create DFS partition for each tenant, then we will hit AWS limit for maximal number of mounted devices.

I am not a big expert here. But it is actually strange for me that the task which people are trying to solve from 60-th (first multiuser OS) is not yet solved. Also it is not clear for me how Amazon RDS used to work with EBS if "it is impossible".

What about combined approach? Do not create single DFS partition for all tenants but do not create pertinent partitions as well. Let's say rotate we create N partitions for K tenants each? Inside one partition we are relying on standard permission/quota mechanism to isolate tenant's data.

I am not sure how authentication os performed in DFS. Do we actually need to create correspondent users at host system? Are there some differences between VMs and PODs? But I absolutely sure that there should be some solutions.

Sharing single DFS partition between multiple tenants will solve the problem with limit for number of mounted devices. It is obvious that tenants from one node should access the same partition. Less obvious what to do when tenant is migrated to some other node. But the straightforward solution is just to copy this data from one partition to another.

I still trying to push DFS solution because only in this case we solve the problem with temp/logged tables, limits for index size, custom, extension storage, ...

Feb 07 '24 07:02 knizhnik

On the topic of a separate service:

IMO the most pivotal point made above is that the semantics of slot storage are different to what we have for postgres pages today. So:
- we definitely need a logically separate service
- then there's a separate question about whether it should be physically separate (run as a separate process, use a separate DB, etc).
If we were to build it in the pageserver, I think we'd want to define some special area of the keyspace with its own semantics like "latest value only" for some content, so that after we do our log-structured ingest to L0 layers, we can compact away all but the latest values as soon as we're writing L1 etc. If we taught keyspace partitioning to recognize this special key tiny key range, it could have its own simple rules for compaction etc (like, really simple: full image compaction every 5 minutes or something like that).
Arpad has suggested building a simple hash table on top of a key range in the pageserver, where the keys act as slots in the hash table to which we can append values, to get away from the current monolithic single-key storage of all the slots. I quite like that idea, and it would be worth building if we were building this generic "store what you want" slot service inside the pageserver.
Writing a physically independent service is clearly a valid and workable approach as well. It has the major upside of decoupling development work from the pageserver. My worry would be the operational side of it: we have plans for pageserver to do things like gracefully migrating tenants around during upgrades to hide downtime, it would be a shame if a physical independent slot service didn't benefit from that. That's not a decisive downside, but it's worth calling out.

Feb 23 '24 16:02 jcsp

IMO the most pivotal point made above is that the semantics of slot storage are different to what we have for postgres pages today.

I still do not completely agree with it. Pages access pattern can be also quite different... Yes, looks like do not need versioning for slots information (now?), although snapshot are "naturally" versioned. But I do not think that is "axiom" (slots need not to be versioned). It seems to be unclear now how to combine slots with branching, but for PiTR it seems to be expected behaviour - to restore slots state at the particular point of time.

So I can not agree with @hlinnaka that storage for the slots is very different from other AUX files and should be implemented in some other way. Yes we have faced with problems with storing snapshots which I have not expected and which cause real risks in production. But now (first of all thank's to your PR) this problem are solved. I do not see strong necessity in developing one more special services. And only for slots and not for AUX files...

From mu point of view it is very important and correct that we are storing in out KV storage not even relational data but also SLRUs, checkpoints and AUX files. There are much more cons of this approach than pros.

I think we'd want to define some special area of the keyspace with its own semantics like "latest value only" for some content,

What to notice it once again that there are good reasons to retrieve not only the last value. At least PiTR and may be in future we will find more use cases for it. For statistics data is versioning is obvious.

Arpad has suggested building a simple hash table on top of a key range in the pageserver, The mai question is not how to implement it in PS but how to provide durability and consistency with other data. Once slot state is updates, Postgres is vacuum able to perform some updates which affect main storage. It means that update of slots and other Postgres objects stored in KV storage should be coordinated (2PC???). Also PS is not intended to be reliable storage. So we have to store this data in S3. But we can not wait until write to S3 is acknowledged.

Writing a physically independent service is clearly a valid and workable approach as well. IMHO it is bad idea, especially if we are going to use it only for slots.

Feb 23 '24 19:02 knizhnik

Here are some more thoughts on AUX files and logical replication.

When we were looking into adding another type of AUX file (replication origin) for incoming logical replication, I started thinking that maybe we need to rethink our whole approach to this problem.

Just to recap, we were under the impression that PostgreSQL needs only a small number of extra files (from dozens to hundreds) in some cases. These files include things like replication slots, logical decoding snapshots, vacuum stats, pages access stats, two-phase commit state files, etc. And, PostgreSQL doesn't really have a concept of an AUX file. That was our idea for naming all the files we haven't stored yet.

It quickly became clear that our guess about the number of AUX files was off. This was kind of surprising, especially with logical replication. It was mainly because PostgreSQL doesn't usually notice when it's creating thousands of small short-lived files in some directory that would later disappear. But our approach to storing all AUX files in one hashmap and rewriting the whole hashmap for every key change led to a huge increase in storage use. Then we started fixing PostgreSQL's edge cases that create a lot of files. It looks like we can handle it for outgoing logical replication, but it feels like a losing battle. Over time, we'll probably run into more and more situations like this.

Why did we put those files in a hashmap? Mainly because those files need list operations and our storage can't do key range scans. PostgreSQL uses readdir() to list these files, and we need to find a way to support something similar. If we had key range scans, we could put each AUX file under its own key and scan a key prefix to list those files, avoiding the issues we've faced.

This isn't a new problem. We have a similar issue with listing relations, where PostgreSQL also relies on readdir(), and we use a global hashmap instead. This means anyone creating tables in a loop could cause more issues (it's easier to create a lot of tables than, say, logical decoding snapshots). So, it looks like we need some scalable key range scan operation. Internally, this could mean using some catalog (not a hashmap, because there are better data structures for storing data in external memory), or maybe we could rely more on the filesystem (though we might still need a catalog for files on S3).

So, how do we get key range operations in our storage? We've talked about the maximum number of keys in a few discussions. PostgreSQL has its limits, like the number of tables. On most filesystems, the number of inodes would be a tighter limit. Anyway, we should think about what it would take to handle 10 million keys or more. At first glance, there are plenty of B-trees/LSM-trees out there that could handle this easily. If we can't or don't want to go that route, we need to think about tighter limits (though I doubt we can go much below the 100k - 1M range without affecting some use cases).

Mar 11 '24 11:03 kelvich

When we were looking into adding another type of AUX file (replication origin) for incoming logical replication,

We are not looking for adding another type of AUX file (replication origin) - it was already there and already merged. The problem we are going to solve is something different: Posgres is advancing replication origin on each transaction commit. It doesn't need to persist this data on each TX commit because in case of crash, Postgres will replay WAL and redo all this commits including replacing origins.

We can certainly also handle real origin in commit WAL record in PS. As I wrote above, there are two question:

Where to store real origin?
How to pass it to compute?

The most obvious place (where it is already stored) is pg_logical/replorigin_checkpoint AUX file. But if this file has to be rewritten on each commit it will cause very negative impact on performance.

Another obvious place which we periodically flush to the disk is timeline metadata. It contains disk_consistent_lsn and in case of PS crash two will replay all WAL starting from this point. So we can add repl_origin_lsn to Metadata and update it in memory pin each commit. Then basebackup can include it in pg_logical/replorigin_checkpoint. Yes, to do it, PS has to know format of this file. Also adding field to Metadatas file requires format change. We have field for metadata format version, but until now there are just two version of metadata and many Neon code is written in the assumption, that there are just old and new format of Metadata. Definitely sometime we have to rewrite it...

Yet another alternative is not to extend Metadata file, maintain origin LSN somewhere else in memory and when metadata is flushed to the disk, also flush pg_logical/replorigin_checkpoint AUX file. But first update origin lsn in it (once again PS need to know format of this file).

It looks like we can handle it for outgoing logical replication, but it feels like a losing battle.

I can not completely agree with you. First of all they're still just one "massive" file: TX snapshots. Yes I made a big mistake not correctly estimating number of this files. But now after a number of PRs were pushed, the problem seems to be solved:

We use deltas to avoid quadratic size explosion cause by serialising the whole directory
We limit number of snapshots at compute
We limit number of subsequent deltas at PS tp reduce reconstruction time.

We are going to persist some other files: statistics and pg_prewarm. But here the problem is more with size of the file rather with their number (singletons).

So, how do we get key range operations in our storage?

Well, we have one range operation in our KV storage: drop. But it is not implemented:) Actually range scan can be implemented if we can include some sequence numbers in AUX file keys. The we can just iterate through all files until get error. Yes, it's awkward that we do not have check_if_key_exists operation and have to catch errors instead. But the main problem with this approach is that extraction all this files can take significant amount of time. And we have already faced with this problem: once delta approach to storing AUX files was implemented it took 40 minutes to startup node with 70k snapshot files. And it is certainly not because we have to add 70k entries to hash map (I assume that it can be done in few ms). So if we will have range API and store each AUX file as separate <key,value> pair, then startup will take the comparable time if there are several thousands of such files.

Definitely we can provide more efficient implementation of rage scan rather than just enumerate and lookup keys in the loop. But I do not think that it can help to radically change total time. And please notice that we need to include ALL AUX files in base backup: there is no on-demand download mechanism for AUX files now (inline SLRUs).

So conclusion: From my point of view we have not lost battle with AUX files and there is no need to try to invent some other alternative. The problem with large number of snapshots is solved. The problem with replication origins can be solved.

Range key operation is intended to solve the problem with large number of files, which doesn't exist anymore. And IMHO it will not be able to fix it in any case. Also it dopes't help much to solve the problem with large files (for example statists state can be few Mb for 100k relations).

Mar 11 '24 15:03 knizhnik

When we persist Aux files we need to take the following into consideration (see https://neondb.slack.com/archives/C03QLRH7PPD/p1711365177901189?thread_ts=1711363664.903439&cid=C03QLRH7PPD)

if we support persisting AUX files in the future, when the AUX file has been created on a different architecture and we restart an endpoint's compute on a different architecture we need to throw away the AUX files to play it safe at least for AUX files for which we haven't validated they're compatible across architectures.
This also probably needs to happen for PG upgrades, as those also may change the format of AUX files.

Mar 25 '24 11:03 Bodobolero