cloudpathlib
cloudpathlib copied to clipboard
Add ability to turn off local cache
Some use cases (especially if we enable streaming) may not want to keep much of a cache around at all.
We should have an option to turn off the cache entirely, which always will grab things from the server.
Hey. Wondering what the status/thoughts are on this one. Seems like local cache is quite central to all the functionality currently, so might be tricky to implement.
Asking as we have instances where we have write only access to S3 and cache logic is an issue - another, probably a lot less common than streaming, example of where ability to turn off cache would be great.
Speculation: I wonder if perhaps force_overwrite_from_cloud
could be a way to make this work. So the mechanism is: someone has force_overwrite_from_cloud=True
, then those actions would skip trying to read anything to create or refresh the cache before writing. Maybe we could implement that without any significant redesign.
... someone has
force_overwrite_from_cloud=True
, then those actions would skip trying to read anything to create or refresh the cache before writing. ...
@jayqi are you thinking that open
could be rewritten so that force_overwrite_from_cloud=True
leads it to skip any reading and updating cache/stats, etc.? This would not break the API, but would change the functionality. It would also sort out my use case, provided I stick to open
. [Getting write_text
and _bytes
to work too might require allowing one to pass force_overwrite_from_cloud=True
(optionally) at S3Client
(or other equivalent) level and then make it trickle down to open
, and the other relevant methods. This would maybe be needed for consistency, but would also break the API - though maybe with defaults that's OK.]
It's probably not a solution for streaming and cases when the cache itself is an issue rather than the current read required to write/open.
Good additional discussion in #233.
This is still under consideration. There is a set of functionality that depends on the cache, so those are worth enumerating. The cache gets created any time _refresh_cache
gets called, which is the following scenarios:
-
__fspath__
- this is important for interoperability since some libraries use it to do file opening themselves; not sure that we can't not download the file and still make this work. -
_dispatch_to_local_cache_path
and therefore these downstream methods:-
stat
- not an issue since it is implemented by all official clients, but maybe not plugins -
read_bytes
- we could implement this in memory in theClient
and not cache if a variable is set, or implement this viaopen
like we do withwrite_bytes
-
read_text
- we could implement this in memory in theClient
and not cache if a variable is set, or implement this viaopen
like we do withwrite_text
-
-
open
- we could if a setting is set delete the local file when the buffer is closed; we'd need to do this in the read scenario in addition to the write scenario, which is already patched.
Simplest implementation seems to be:
- Send
read_*
throughopen
- in
open
, patchbuffer.close
to delete cache if setting is set on read or write - in
__fspath__
raise an error that local paths cannot be referenced if the cache is off - I think we probably want this as a env var that is read when clients are instantiated so you can set it both in the environment and programatically
A few additional observations:
- Currently it should be the case that when the client is garbarge collected the cache is cleaned if it is a temp dir. It is a little curious this doesn't solve more of the cases of large caches that people report. https://github.com/drivendataorg/cloudpathlib/blob/ce411fe22b138a183b1b17b815c3141a42e481b1/cloudpathlib/client.py#L44-L47
- We could add similar functionality to
CloudPath.__del__
so they are removed from local cache on garbage collection; this may happen more frequently than the client getting garbage collected and give us more leeway - We should definitely add an explicit
remove_from_local_cache
method (name tbd) toCloudPath
and aempty_local_cache
method toClient
so there is an official API to remove the cache files if people want, both for an individual file and for a -
download_to
andupload_from
do not cache files (because presumably you already are tracking where they are on your local file system), but #292 requests that they do
Other options:
- Bound the cache size; if new file puts cache over the limit, delete the least recently accessed file in the cache. This has the downside of incurring some additional overhead we could hit often to do size checking and then looking for least recently accessed, then removing. (Separately tracked in #11)
- Delete from cache just when uploads finish (downside: files downloaded for read actions may stay in the cache)
- Implement the
read_*
methods with streaming if no cache so we only have a local copy when uploading (could be combined with above)
@DayalStrub Changes to options for managing the cache were just released in the latest version. See updated methods here: https://cloudpathlib.drivendata.org/stable/caching/#clearing-the-file-cache