cloudpathlib icon indicating copy to clipboard operation
cloudpathlib copied to clipboard

Cache timestamps are wrong while downloading

Open mil-ad opened this issue 5 months ago • 2 comments

When cloudpathliub sees a file for the first time it start downloading it to the cache location and only once the download is complete it sets its timestamps to the date read from the cloud version. This means that while the file is being downloaded it has the wrong (current date) timestamps!

This is normally not an issue because cloudpathlib uses a tmpdir but it also allows setting CLOUDPATHLIB_LOCAL_CACHE_DIR and this would cause issues if you spin up multiple processes that access the same file:

For example if you're dealing with a large file downloading it could take a while and the cached file has the wrong date in that period. If you start a second copy of your program before the download finishes then the second instance sees the incomplete cached file with the wrong date and raises cloudpathlib.exceptions.OverwriteNewerLocalError.

Even if the file had the correct date as it was being downloaded by the first process the second process wouldn't know if the download has finished or not.

I think at least the documentation for CLOUDPATHLIB_LOCAL_CACHE_DIR should be updated to mention that its not multi thread/process safe.

mil-ad avatar Jul 09 '25 20:07 mil-ad

Thanks for the detailed debugging and thoughts!

First, it is definitely worth clarifying in our docs what scenarios are safe in parallel. Things should be safe if your unit of work is separate files on cloud storage, which is how we usually structure the workloads.

Second, I'm interested in what behavior you would want to see in the scenario you describe. Do you think the second thread should: (1) error with a different more clear error (like CacheDownloadInProgress so you can handle that error, (2) wait for download from thread 1 to finish and use the cached version, (3) also download a separate copy to a separate, thread-2 specific cache location, (4) something else.

pjbull avatar Jul 10 '25 16:07 pjbull

Thanks for the reply.

I think you should be able to make sure the file has the right timestamp as it's downloaded. But that wouldn't solve this issue anyway, If you actually want to support multi-process caching then it shouldn't be too hard to implement a readers–writer lock with fcntl i.e. allow concurrent access to cache files for read-only operations, while write operations require exclusive access (i.e. solution (2)):

import fcntl

def write_to_cache(filename, data):
    with open(filename, 'w') as f:
        fcntl.flock(f.fileno(), fcntl.LOCK_EX) 
        f.write(data)
        ...

def read_from_cache(filename):
    with open(filename, 'r') as f:
        fcntl.flock(f.fileno(), fcntl.LOCK_SH)
        data = f.read()
        ...

This is not bullet-proof, for example fcntl has mixed-support on NFS drives but it's better.

mil-ad avatar Jul 11 '25 11:07 mil-ad