cloudpathlib icon indicating copy to clipboard operation
cloudpathlib copied to clipboard

Add ability to turn off local cache

Open pjbull opened this issue 4 years ago • 3 comments

Some use cases (especially if we enable streaming) may not want to keep much of a cache around at all.

We should have an option to turn off the cache entirely, which always will grab things from the server.

pjbull avatar Aug 17 '20 17:08 pjbull

Hey. Wondering what the status/thoughts are on this one. Seems like local cache is quite central to all the functionality currently, so might be tricky to implement.

Asking as we have instances where we have write only access to S3 and cache logic is an issue - another, probably a lot less common than streaming, example of where ability to turn off cache would be great.

DayalStrub avatar Sep 07 '21 11:09 DayalStrub

Speculation: I wonder if perhaps force_overwrite_from_cloud could be a way to make this work. So the mechanism is: someone has force_overwrite_from_cloud=True, then those actions would skip trying to read anything to create or refresh the cache before writing. Maybe we could implement that without any significant redesign.

jayqi avatar Sep 07 '21 21:09 jayqi

... someone has force_overwrite_from_cloud=True, then those actions would skip trying to read anything to create or refresh the cache before writing. ...

@jayqi are you thinking that open could be rewritten so that force_overwrite_from_cloud=True leads it to skip any reading and updating cache/stats, etc.? This would not break the API, but would change the functionality. It would also sort out my use case, provided I stick to open. [Getting write_text and _bytes to work too might require allowing one to pass force_overwrite_from_cloud=True (optionally) at S3Client (or other equivalent) level and then make it trickle down to open, and the other relevant methods. This would maybe be needed for consistency, but would also break the API - though maybe with defaults that's OK.]

It's probably not a solution for streaming and cases when the cache itself is an issue rather than the current read required to write/open.

DayalStrub avatar Sep 08 '21 20:09 DayalStrub

Good additional discussion in #233.

This is still under consideration. There is a set of functionality that depends on the cache, so those are worth enumerating. The cache gets created any time _refresh_cache gets called, which is the following scenarios:

  • __fspath__ - this is important for interoperability since some libraries use it to do file opening themselves; not sure that we can't not download the file and still make this work.
  • _dispatch_to_local_cache_path and therefore these downstream methods:
    • stat - not an issue since it is implemented by all official clients, but maybe not plugins
    • read_bytes - we could implement this in memory in the Client and not cache if a variable is set, or implement this via open like we do with write_bytes
    • read_text - we could implement this in memory in the Client and not cache if a variable is set, or implement this via open like we do with write_text
  • open - we could if a setting is set delete the local file when the buffer is closed; we'd need to do this in the read scenario in addition to the write scenario, which is already patched.

Simplest implementation seems to be:

  • Send read_* through open
  • in open, patch buffer.close to delete cache if setting is set on read or write
  • in __fspath__ raise an error that local paths cannot be referenced if the cache is off
  • I think we probably want this as a env var that is read when clients are instantiated so you can set it both in the environment and programatically

A few additional observations:

  • Currently it should be the case that when the client is garbarge collected the cache is cleaned if it is a temp dir. It is a little curious this doesn't solve more of the cases of large caches that people report. https://github.com/drivendataorg/cloudpathlib/blob/ce411fe22b138a183b1b17b815c3141a42e481b1/cloudpathlib/client.py#L44-L47
  • We could add similar functionality to CloudPath.__del__ so they are removed from local cache on garbage collection; this may happen more frequently than the client getting garbage collected and give us more leeway
  • We should definitely add an explicit remove_from_local_cache method (name tbd) to CloudPath and a empty_local_cache method to Client so there is an official API to remove the cache files if people want, both for an individual file and for a
  • download_to and upload_from do not cache files (because presumably you already are tracking where they are on your local file system), but #292 requests that they do

Other options:

  • Bound the cache size; if new file puts cache over the limit, delete the least recently accessed file in the cache. This has the downside of incurring some additional overhead we could hit often to do size checking and then looking for least recently accessed, then removing. (Separately tracked in #11)
  • Delete from cache just when uploads finish (downside: files downloaded for read actions may stay in the cache)
  • Implement the read_* methods with streaming if no cache so we only have a local copy when uploading (could be combined with above)

pjbull avatar Dec 19 '22 00:12 pjbull

@DayalStrub Changes to options for managing the cache were just released in the latest version. See updated methods here: https://cloudpathlib.drivendata.org/stable/caching/#clearing-the-file-cache

pjbull avatar Feb 16 '23 02:02 pjbull