filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

Wrap cache-file-management operations for CachingFileSystem in thread locks

Open jwodder opened this issue 2 years ago • 7 comments

Tries to address #1126.

jwodder avatar Nov 29 '22 21:11 jwodder

is this PR enough to revert https://github.com/datalad/datalad-fuse/pull/83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?

yarikoptic avatar Dec 02 '22 19:12 yarikoptic

is this PR enough to revert datalad/datalad-fuse#83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?

if this is a PR which is currently tested on drogon, it seems it did improve situation over when locked at datalad-fuse level. I see more of MATLAB processes getting at least somewhat busy, total load is about 20 not 60, datalad fusefs process is quite busy. py-spy top'ing it though shows

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                                           
 61.00%  75.00%    2.79s     3.24s   save_cache (fsspec/implementations/cached.py)
  9.00%   9.00%   0.290s    0.290s   read (ssl.py)
 14.00%  14.00%   0.250s    0.250s   exists (genericpath.py)
  4.00%   6.00%   0.210s    0.250s   atomic_write (fsspec/implementations/cached.py)
  7.00%   7.00%   0.200s    0.210s   _execute_child (subprocess.py)
  1.00%  13.00%   0.170s    0.680s   _read_ready__data_received (asyncio/selector_events.py)
  1.00%   1.00%   0.090s    0.090s   write (asyncio/selector_events.py)

and getting 100 samples of any /cache file being accessed:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ for s in `seq 1 100`; do ls -l /proc/3986540/fd/ 2>&1 | grep 'fsspec/cache'; done
lr-x------ 1 dandi dandi 64 Dec  5 11:44 29 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lr-x------ 1 dandi dandi 64 Dec  5 11:44 59 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 15 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-h3r32ds3
lr-x------ 1 dandi dandi 64 Dec  5 11:44 6 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ ls -ld /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
-rw------- 1 dandi dandi 92584 Dec  5 11:44 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache

shows that we are spending lots of time on writing/re-reading cache file. Could that be avoided may be?

yarikoptic avatar Dec 05 '22 16:12 yarikoptic

The continuous writing of the cache file is designed for the situations that you might have simultaneous access to the cache from multiple processes. If you only ever have the one one FUSE process using the cache, I don't suppose there's any good reason to write the cache until program end (if you can guarantee not to crash).

martindurant avatar Dec 05 '22 16:12 martindurant

Thanks @martindurant , so it feels like ideally it could be an option like cache_dump : ["always", "exit", "periodic"] where "always" is current implementation, "exit" where it would dump it only atexit (/ deconstructor call?), and "periodic" - have some thread which would look after the cache and if modified since last save -- save it (but then would need some extra setting for how often :-/ ). May be we could start with getting "always", "exit" implemented @jwodder ?

yarikoptic avatar Dec 05 '22 17:12 yarikoptic

Exactly. Again, it would be nice to decompose the caching logic into

  • what is stored
  • where it is stored
  • how/often it is updated
  • how target files map to local files

martindurant avatar Dec 05 '22 17:12 martindurant

@martindurant that sounds like a good amount of work to overhaul entire caching layer. Would you consider PR which would first just augment life cycle of the cache with "always" and "exit" strategies to when it is dumped?

yarikoptic avatar Dec 13 '22 14:12 yarikoptic

I can understand hesitating to do the whole job, and the augmentation may be alright. What I worry about, it extra keyword arguments that we will find had to go back on later.

martindurant avatar Dec 13 '22 14:12 martindurant

@martindurant before any other major RF -- what do you think about this particular PR? IMHO it is "the best we could do" ATM to ensure thread-safety, seems to work for us in that at least we do not crash (right @jwodder ?) and should not really have noticeable penalty on users (run time performance of which would be bottlenecked primarily by all those cache read/writes anyways)

yarikoptic avatar Dec 19 '22 17:12 yarikoptic

I think it's OK... but still share the worry that the locking is too aggressive. Are the list operations really problematic, or just the file ops (which is what the original issue was about)?

martindurant avatar Dec 22 '22 15:12 martindurant

Hi. I did not look too deep into the details of the struggles here, but I wanted to take the chance to humbly remind you about GH-895 by @gutzbenj. The patch probably needs a refresh, but the core idea was to start using DiskCache, as it promises to be a rock-solid (thread-safe and process-safe) solution to disk-based caching, without needing to run any kind of cache- or database-server. On this matter, also being tested on Linux, macOS and Windows, we think this package is one of its own kind.

amotl avatar Dec 25 '22 23:12 amotl

You may be right, but DiskCache should only be optional, as fsspec is very careful not to require any dependencies.

Also, you should be aware that we wish to be able to cache files to remote stores too! It may seem counter intuitive, but some remotes are faster than others, and egress fees often mean that copying into near-remote once is worthwhile if there are to be many reads.

martindurant avatar Dec 29 '22 18:12 martindurant

@martindurant: Thank you for sharing those important details we have not been aware of. Happy new year everyone!

amotl avatar Dec 31 '22 20:12 amotl

So where shall we go from here? A full rewrite of the cache classes seems to beyond anyone's capacity right now :(

martindurant avatar Jan 02 '23 14:01 martindurant

note that diskcache might be unfit for cases where location is on NFS. https://github.com/grantjenks/python-diskcache/issues/198 was "resolved" but not clear to me if there is NFS-safe way to use diskcache.

I hope that @jwodder could have a look and may be see how to refactor/structure the interface to encapsulate caching so alternative implementations could be provided.

yarikoptic avatar Jan 04 '23 04:01 yarikoptic