Figuring out our relationship with the filesystem

Open njsmith opened this issue 1 year ago • 6 comments

Currently we handle all our on-disk storage through the KVFileStore and KVDirStore abstractions. They're both basically key->value mappings, where the value is either a file or a directory respectively. They have per-key locking, and try to implement atomic updates when possible.

But these aren't necessarily the best abstractions for what we need, because I wrote them before knowing how we were going to use them. And also, while I thought they should work on Windows, it turns out there were a few details I was missing (see #4) that makes them pretty fragile, esp. in the presence of operations like filesystem indexing or AV scanning that can randomly open files. And they don't currently have any support for garbage-collecting old data.

So here's a brain dump about what actual KV stores we've ended up with and what properties each one needs.

hash_cache: map artifact hash -> artifact (i.e. wheel/sdist/pybi)
- Holds: blobs
- Access pattern: write once, must not have partial writes
- Cleanup: can discard freely, but can't break ongoing reads, which are definitely incremental
- Locking is useful to avoid redundant work if multiple posy invocations are running simultaneously
metadata_cache: maps artifact hash -> core METADATA for that artifact (useful to skip the dance required to pull it out of a remote zip file, and saves locally built metadata from sdists)
- same properties as hash_cache, except that we always slurp in the whole file in one shot
wheel_cache: maps sdist hash -> directory of wheels that we've built from it
- Holds: directory of named blobs (wheels)
- Access pattern: each wheel inside is write once, must not have partial writes
- Cleanup: can discard freely, but do incremental readdir and file reads
- Locking is especially useful to avoid redundant work if multiple posy invocations are running simultaneously
http_cache: maps request info -> request metadata + previous response
- Holds: blobs
- Access pattern: read/modify/write. Currently we do streaming reads. Bodies are mostly simple API pages, so tens to hundreds of kilobytes. In the future might include other larger items, like if we decide to support somepkg @ https://.../somepkg.whl. Must not have partial writes.
- Cleanup: can discard freely, but can't break ongoing reads -- which are currently incremental, but might be able to do one-shot slurp into buffer
EnvForest: maps wheel/pybi hash to unpacked tree, or sdist hash to a directory containing unpacked trees
- Holds: whole complex directory hierarchies
- Access pattern: write once
- Cleanup: can only discard items that aren't currently in use by any running environment (eek -- how do we know if an environment is running?)
build_store: maps sdist hash -> build scratch space
- Holds: whole directory hierarchies
- Access pattern: arbitrary code runs and mutates whatever it wants
- Cleanup: allocated per-process, so can just discard when done

https://stackoverflow.com/a/57358387/ has some very smart sounding comments about what actually works on Windows, and one claim is that deleting a large file on NTFS can itself be non-atomic, and a badly timed crash could leave the file truncated instead. There's also the newer and undocumented (but public) FILE_RENAME_FLAG_POSIX_SEMANTICS which.... might do something useful? Might need to experiment to figure out what this thing actually does. [Edit: Turns out you need to read the kernel-level documentation. The answer is that it lets you overwrite a destination file that still has open handles. It doesn't, AFAIK, do anything to help with the case where the source file has open handles.]

I'm worried about data integrity; specifically, we currently trust that hash_cache/metadata_cache/wheel_cache/http_cache/EnvForest are normative, so if a truncated entry ended up there then everything could become wedged permanently until someone manually clears out the caches. That would suck. (On the other hand, lack of durability is fine -- if a value gets lost, or gets truncated but we can detect that it's truncated and discard it, then that's OK; these things can all be reconstructed if needed.)

Blob storage

For the ones that store blobs, we might just want to use a full-fledged transactional store, like sqlite or bdb. Trade-offs:

Makes transactional integrity into Someone Else's Problem
At least sqlite (in WAL mode) can dramatically reduce the number of fsync's (one per WAL checkpoint, which doesn't even have to happen on every run), if you don't need durability, which we don't
Probably suboptimal performance for large files (hash_cache, wheel_cache). For metadata_cache and http_cache sqlite should handle them fine, possibly as BLOBs.
Makes insert/modify/delete all safe, but doesn't have locking to prevent redundant work (though this can be done separately)
Requires fiddling with SQL or whatever

The main alternative is to do something like we're doing, with one on-disk file per value, which has a few challenges.

For integrity, files require either fsync and then atomic rename, or else some sort of checksum verification so we can detect and discard corrupted values. Neither is super attractive... fsync can make writes pretty expensive, and on Windows atomic rename can be thwarted by open handles. (I guess you can sleep and retry?) Though Windows does have CreateHardLink so you could at least link the file into place and then worry about deleting the original tmp file later opportunistically. (And this can even overwrite an existing file if you use NtSetInformationFile + FILE_LINK_INFORMATION + FILE_LINK_REPLACE_IF_EXISTS + FILE_LINK_POSIX_SEMANTICS.) Checksums make writes easy and fast, but then when you open the file again you have to read through the whole thing to validate the checksum, before you know whether you have a file at all. Fast checksums can be very fast (on my laptop even sha256 goes at ~1.6 GB/s according to openssl speed, and I assume crc32c would be even faster), but that's still an extra human-perceptible lag for multi-hundred-megabyte GPU wheels, and extra I/O. (Which might get hidden by caching, if the whole file fits in cache and the OS doesn't activate dropbehind logic for sequential scans and if we're going to read the whole file anyway, like we usually will for artifacts.)

I guess another option to avoid rename on Windows would be: write the file directly to its final name, fsync, and then put a marker file next to it to record that the main file is complete and trustworthy.

Or, we can combine them: write the file to disk, fsync, and then commit a record in sqlite saying that it's there and valid.

Finally, there's the question of garbage collection: for files, I think this is actually pretty easy? Unix and Windows do both support deleting a file while letting current readers continue (for Windows you need a magic POSIX_SEMANTICS flag but it's there).

Tentatively, it seems like we might want to use sqlite for metadata_cache and http_cache, and files for hash_cache and wheel_cache.

Directories

build_store isn't a big deal, because usage is restricted to a single process. We can just create directories and write to them whenever we like. We might specifically want to avoid renaming the directory into place, to avoid Windows issues. If we add multi-threading in the future then we'll probably want some kind of locking. But that's about it.

EnvForest OTOH is... idk, maybe intractable, in two different ways:

When unpacking wheels/pybis into it, the only way to guarantee integrity is to fsync every file and directory. Ugh! (Well I guess on Unix we could also call sync(2) but that's gross too.) ...I guess an alternate Overly Clever solution would be to do direct I/O and regain the performance with aggressively dispatching the IO through io_uring/IOCP/threads. Maybe if you use threads to call aggressively fsync on lots of files in parallel it ends up being not so bad? Are modern FS's all clever enough to batch the journal transactions? idk. Maybe we should just give up on integrity here.
- On the upside (?) though, the write-then-rename code we're currently using is actually unnecessary, because we hold a lock while writing the directory! Which potentially makes Windows support way easier. As long as we're prepared to handle the case where our process dies half-way through writing things out.
For GC'ing stuff: we have no reliable way to know whether an entry is in use. I guess when python starts up we could have it take a read-lock on all the entries it's using? But an entry can still be in use even if no python process is running, because it could be lurking in an environment variable in a non-python process (e.g. a shell), and then later it spawns a python that will expect all the entries to be there. I guess in this case the python process could at least detect at startup if anything is missing, and fail noisily? Or if we want to be ridiculously clever, it could invoke posy to fill the cache again before continuing...

Jan 29 '23 09:01 njsmith

posy posy copied to clipboard

Figuring out our relationship with the filesystem

Blob storage

Directories

posy
posy copied to clipboard