filesystem_spec
filesystem_spec copied to clipboard
Wrap cache-file-management operations for CachingFileSystem in thread locks
Tries to address #1126.
is this PR enough to revert https://github.com/datalad/datalad-fuse/pull/83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?
is this PR enough to revert datalad/datalad-fuse#83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?
if this is a PR which is currently tested on drogon, it seems it did improve situation over when locked at datalad-fuse level. I see more of MATLAB processes getting at least somewhat busy, total load is about 20 not 60, datalad fusefs
process is quite busy. py-spy top'ing it though shows
%Own %Total OwnTime TotalTime Function (filename)
61.00% 75.00% 2.79s 3.24s save_cache (fsspec/implementations/cached.py)
9.00% 9.00% 0.290s 0.290s read (ssl.py)
14.00% 14.00% 0.250s 0.250s exists (genericpath.py)
4.00% 6.00% 0.210s 0.250s atomic_write (fsspec/implementations/cached.py)
7.00% 7.00% 0.200s 0.210s _execute_child (subprocess.py)
1.00% 13.00% 0.170s 0.680s _read_ready__data_received (asyncio/selector_events.py)
1.00% 1.00% 0.090s 0.090s write (asyncio/selector_events.py)
and getting 100 samples of any /cache
file being accessed:
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ for s in `seq 1 100`; do ls -l /proc/3986540/fd/ 2>&1 | grep 'fsspec/cache'; done
lr-x------ 1 dandi dandi 64 Dec 5 11:44 29 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec 5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec 5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec 5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lr-x------ 1 dandi dandi 64 Dec 5 11:44 59 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 15 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-h3r32ds3
lr-x------ 1 dandi dandi 64 Dec 5 11:44 6 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec 5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ ls -ld /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
-rw------- 1 dandi dandi 92584 Dec 5 11:44 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
shows that we are spending lots of time on writing/re-reading cache file. Could that be avoided may be?
The continuous writing of the cache file is designed for the situations that you might have simultaneous access to the cache from multiple processes. If you only ever have the one one FUSE process using the cache, I don't suppose there's any good reason to write the cache until program end (if you can guarantee not to crash).
Thanks @martindurant , so it feels like ideally it could be an option like cache_dump : ["always", "exit", "periodic"]
where "always"
is current implementation, "exit"
where it would dump it only atexit
(/ deconstructor call?), and "periodic" - have some thread which would look after the cache and if modified since last save -- save it (but then would need some extra setting for how often :-/ ). May be we could start with getting "always", "exit"
implemented @jwodder ?
Exactly. Again, it would be nice to decompose the caching logic into
- what is stored
- where it is stored
- how/often it is updated
- how target files map to local files
@martindurant that sounds like a good amount of work to overhaul entire caching layer. Would you consider PR which would first just augment life cycle of the cache with "always" and "exit" strategies to when it is dumped?
I can understand hesitating to do the whole job, and the augmentation may be alright. What I worry about, it extra keyword arguments that we will find had to go back on later.
@martindurant before any other major RF -- what do you think about this particular PR? IMHO it is "the best we could do" ATM to ensure thread-safety, seems to work for us in that at least we do not crash (right @jwodder ?) and should not really have noticeable penalty on users (run time performance of which would be bottlenecked primarily by all those cache read/writes anyways)
I think it's OK... but still share the worry that the locking is too aggressive. Are the list operations really problematic, or just the file ops (which is what the original issue was about)?
Hi. I did not look too deep into the details of the struggles here, but I wanted to take the chance to humbly remind you about GH-895 by @gutzbenj. The patch probably needs a refresh, but the core idea was to start using DiskCache, as it promises to be a rock-solid (thread-safe and process-safe) solution to disk-based caching, without needing to run any kind of cache- or database-server. On this matter, also being tested on Linux, macOS and Windows, we think this package is one of its own kind.
You may be right, but DiskCache should only be optional, as fsspec is very careful not to require any dependencies.
Also, you should be aware that we wish to be able to cache files to remote stores too! It may seem counter intuitive, but some remotes are faster than others, and egress fees often mean that copying into near-remote once is worthwhile if there are to be many reads.
@martindurant: Thank you for sharing those important details we have not been aware of. Happy new year everyone!
So where shall we go from here? A full rewrite of the cache classes seems to beyond anyone's capacity right now :(
note that diskcache might be unfit for cases where location is on NFS. https://github.com/grantjenks/python-diskcache/issues/198 was "resolved" but not clear to me if there is NFS-safe way to use diskcache.
I hope that @jwodder could have a look and may be see how to refactor/structure the interface to encapsulate caching so alternative implementations could be provided.