s3fs
s3fs copied to clipboard
S3FileSystem.checksum always returns cached value
Hi,
Encountered this unexpected behavior when using the checksum
method to monitor changes in a file in a long-running app.
Description
S3FileSystem.checksum returns the same value for an object, even after the object has changed
Reproduce steps
- Create object
s3://MY_BUCKET/MY_OBJECT
on s3 - Run
import s3fs
fs=s3fs.S3FileSystem()
fs.checksum("s3://MY_BUCKET/foo")
- Overwrite the object with new contents (not using
fs
, but using e.g. the aws cli) - In the same Python process as above, run
fs.checksum("s3://MY_BUCKET/MY_OBJECT")
The results from 2 and 4 are identical, even though the file has changed.
Cause
It seems that checksum
uses the default implementation of fsspec.AbstractFileSystem
, which in turn creates the checksum from the output of fs.ls(detail=True)
. Since in the S3FileSystem
implementation this gets passed a default argument refresh=False
, the checksum is always computed on cached results.
Proposed solution
checksum
should be implemented in S3FileSystem
, in such a way that it always checks the actual state of the object, or at least provides the option to do so.
You can always use s3.invalidate_cache()
if you know that the state of the directory listings is volatile.
https://github.com/intake/filesystem_spec/pull/243 is implementing extra options to the listings cache, so that you can have values automatically expire, but still protect against repeatedly fetching the same listing (or simply turn off caching, if you like).
Thank you, that's good to know! It would probably still make sense to implement the checksum
method, and at least offer the refresh
argument, even if default behaviour would be to use cache?
You could add refresh
, which would be the same as for any other listings method (i.e., invalidates the cache first). This method would just pass it on to info
.
I made #295 with a proposal implementation. I set refresh
to False by default to conform to the rest of the code, although I don't think there's really any use case for this function with cached values? Since cached values by definition have not changed...
The proposed workarounds (use checksum (path, refresh=True) or invalidate_cache (path)) don't seem to work, at least for a directory. Even if I get a new listing (by invalidating the cache and then using ls
, checksum
always seems to return the original cached value. I have to checksum the returned ls
listing.
I don't think a directory has a checksum, does it? It what manner are you changing it between calls?
The usecase is that the files in the directory change, and so need to be reviewed. Hence my workaround. I would expect that if the contents of a directory changed, then the cache would be invalidated. If that's not the designed/specified behaviour, then it would be fine if the documentation pointed this out.
Right, if the files change via some other process, s3fs doesn't know about this. Please do feel free to propose changes to the documentation to clarify things.