File cache not cleared on program end
I am using fsspec in a Jupyter notebook to remotely access files. I've been using a CachingFileSystem to cache files in a local directory. However, the files I'm accessing are very large. Between runs of my notebook, it doesn't appear that the cache gets cleared. This fills up my machine's local storage very quickly, requiring me to manually delete the cache. Is there a mechanism for this in fsspec, or something I'm missing in my usage of the filesystem? If not, I propose there should be.
One fix could be clearing the existing cache directory when a new CachingFileSystem is instantiated with the same cache directory name.
The cache is supposed to be persistent. You can choose to use a temporary location by not providing a specific path, and that should be cleaned up on program exit.
Otherwise, the caching file systems provide methods clear_cache and clear_expired_cache (https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/cached.py#L226) to remove all files or remove files older than some time - you can call this, maybe registered with atexit.
@pl-marasco - the methods should appear in the API reference doc.
Thanks for that insight! I did my best to scour the docs for this info, but didn't find it.
You can choose to use a temporary location by not providing a specific path, and that should be cleaned up on program exit.
I'm not sure this is what's currently happening.
As far as I can tell when not specifying cache_storage, CachingFileSystem ends up calling mkdtemp here and according to its documentation:
The user of mkdtemp() is responsible for deleting the temporary directory and its contents when done with it.
I'm seeing these directories pile up even after program exit. So as far as I can tell, this is consistent with that statement.
I'm also using a more or less transparent cache, so manually calling methods on CachingFilesystem isn't really an option. I also have multiple processes on the same machine, so using a specific path is not advisable. Is there another option that can be used for cleaning up on program exit?
You may well be right. It would be reasonable to register an atexit or weakref.finalize handler to do the delete (recognising that such efforts can still fail).
Any particular preference between the 2 options? I'd be happy to open a PR for this. I'm not familiar with this code base nor weakref.finalize though. It this is the preferred option, I'd appreciate some additional guidance.
Any particular preference between the 2 options?
Not particularly. One could argue that it should be tied to the lifecylce of the cache filesystem instance, meaning use weakref, but then you might see unexpected pauses in execution during garbage collection to clean directories. When the instances are cached until shutdown time, it will make no difference, so the easiest option will do.