filesystem_spec Wrap cache-file-management operations for CachingFileSystem in thread locks

Tries to address #1126.

Nov 29 '22 21:11 jwodder

is this PR enough to revert https://github.com/datalad/datalad-fuse/pull/83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?

Dec 02 '22 19:12 yarikoptic

is this PR enough to revert datalad/datalad-fuse#83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?

if this is a PR which is currently tested on drogon, it seems it did improve situation over when locked at datalad-fuse level. I see more of MATLAB processes getting at least somewhat busy, total load is about 20 not 60, datalad fusefs process is quite busy. py-spy top'ing it though shows

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                                           
 61.00%  75.00%    2.79s     3.24s   save_cache (fsspec/implementations/cached.py)
  9.00%   9.00%   0.290s    0.290s   read (ssl.py)
 14.00%  14.00%   0.250s    0.250s   exists (genericpath.py)
  4.00%   6.00%   0.210s    0.250s   atomic_write (fsspec/implementations/cached.py)
  7.00%   7.00%   0.200s    0.210s   _execute_child (subprocess.py)
  1.00%  13.00%   0.170s    0.680s   _read_ready__data_received (asyncio/selector_events.py)
  1.00%   1.00%   0.090s    0.090s   write (asyncio/selector_events.py)

and getting 100 samples of any /cache file being accessed:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ for s in `seq 1 100`; do ls -l /proc/3986540/fd/ 2>&1 | grep 'fsspec/cache'; done
lr-x------ 1 dandi dandi 64 Dec  5 11:44 29 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lr-x------ 1 dandi dandi 64 Dec  5 11:44 59 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 15 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-h3r32ds3
lr-x------ 1 dandi dandi 64 Dec  5 11:44 6 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ ls -ld /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
-rw------- 1 dandi dandi 92584 Dec  5 11:44 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache

shows that we are spending lots of time on writing/re-reading cache file. Could that be avoided may be?

Dec 05 '22 16:12 yarikoptic

The continuous writing of the cache file is designed for the situations that you might have simultaneous access to the cache from multiple processes. If you only ever have the one one FUSE process using the cache, I don't suppose there's any good reason to write the cache until program end (if you can guarantee not to crash).

Dec 05 '22 16:12 martindurant

Thanks @martindurant , so it feels like ideally it could be an option like cache_dump : ["always", "exit", "periodic"] where "always" is current implementation, "exit" where it would dump it only atexit (/ deconstructor call?), and "periodic" - have some thread which would look after the cache and if modified since last save -- save it (but then would need some extra setting for how often :-/ ). May be we could start with getting "always", "exit" implemented @jwodder ?

Dec 05 '22 17:12 yarikoptic

Exactly. Again, it would be nice to decompose the caching logic into

what is stored
where it is stored
how/often it is updated
how target files map to local files

Dec 05 '22 17:12 martindurant

@martindurant that sounds like a good amount of work to overhaul entire caching layer. Would you consider PR which would first just augment life cycle of the cache with "always" and "exit" strategies to when it is dumped?

Dec 13 '22 14:12 yarikoptic

I can understand hesitating to do the whole job, and the augmentation may be alright. What I worry about, it extra keyword arguments that we will find had to go back on later.

Dec 13 '22 14:12 martindurant

@martindurant before any other major RF -- what do you think about this particular PR? IMHO it is "the best we could do" ATM to ensure thread-safety, seems to work for us in that at least we do not crash (right @jwodder ?) and should not really have noticeable penalty on users (run time performance of which would be bottlenecked primarily by all those cache read/writes anyways)

Dec 19 '22 17:12 yarikoptic

I think it's OK... but still share the worry that the locking is too aggressive. Are the list operations really problematic, or just the file ops (which is what the original issue was about)?

Dec 22 '22 15:12 martindurant

Hi. I did not look too deep into the details of the struggles here, but I wanted to take the chance to humbly remind you about GH-895 by @gutzbenj. The patch probably needs a refresh, but the core idea was to start using DiskCache, as it promises to be a rock-solid (thread-safe and process-safe) solution to disk-based caching, without needing to run any kind of cache- or database-server. On this matter, also being tested on Linux, macOS and Windows, we think this package is one of its own kind.

Dec 25 '22 23:12 amotl

You may be right, but DiskCache should only be optional, as fsspec is very careful not to require any dependencies.

Also, you should be aware that we wish to be able to cache files to remote stores too! It may seem counter intuitive, but some remotes are faster than others, and egress fees often mean that copying into near-remote once is worthwhile if there are to be many reads.

Dec 29 '22 18:12 martindurant

@martindurant: Thank you for sharing those important details we have not been aware of. Happy new year everyone!

Dec 31 '22 20:12 amotl

So where shall we go from here? A full rewrite of the cache classes seems to beyond anyone's capacity right now :(

Jan 02 '23 14:01 martindurant

note that diskcache might be unfit for cases where location is on NFS. https://github.com/grantjenks/python-diskcache/issues/198 was "resolved" but not clear to me if there is NFS-safe way to use diskcache.

I hope that @jwodder could have a look and may be see how to refactor/structure the interface to encapsulate caching so alternative implementations could be provided.

Jan 04 '23 04:01 yarikoptic

filesystem_spec filesystem_spec copied to clipboard

Wrap cache-file-management operations for CachingFileSystem in thread locks

filesystem_spec
filesystem_spec copied to clipboard