s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

S3FileSystem.checksum always returns cached value

Open eeroel opened this issue 4 years ago • 8 comments

Hi,

Encountered this unexpected behavior when using the checksum method to monitor changes in a file in a long-running app.

Description

S3FileSystem.checksum returns the same value for an object, even after the object has changed

Reproduce steps

  1. Create object s3://MY_BUCKET/MY_OBJECT on s3
  2. Run
import s3fs
fs=s3fs.S3FileSystem()
fs.checksum("s3://MY_BUCKET/foo")
  1. Overwrite the object with new contents (not using fs, but using e.g. the aws cli)
  2. In the same Python process as above, run fs.checksum("s3://MY_BUCKET/MY_OBJECT")

The results from 2 and 4 are identical, even though the file has changed.

Cause

It seems that checksum uses the default implementation of fsspec.AbstractFileSystem, which in turn creates the checksum from the output of fs.ls(detail=True). Since in the S3FileSystem implementation this gets passed a default argument refresh=False, the checksum is always computed on cached results.

Proposed solution

checksum should be implemented in S3FileSystem, in such a way that it always checks the actual state of the object, or at least provides the option to do so.

eeroel avatar Mar 09 '20 13:03 eeroel

You can always use s3.invalidate_cache() if you know that the state of the directory listings is volatile. https://github.com/intake/filesystem_spec/pull/243 is implementing extra options to the listings cache, so that you can have values automatically expire, but still protect against repeatedly fetching the same listing (or simply turn off caching, if you like).

martindurant avatar Mar 09 '20 14:03 martindurant

Thank you, that's good to know! It would probably still make sense to implement the checksum method, and at least offer the refresh argument, even if default behaviour would be to use cache?

eeroel avatar Mar 09 '20 14:03 eeroel

You could add refresh, which would be the same as for any other listings method (i.e., invalidates the cache first). This method would just pass it on to info.

martindurant avatar Mar 09 '20 14:03 martindurant

I made #295 with a proposal implementation. I set refresh to False by default to conform to the rest of the code, although I don't think there's really any use case for this function with cached values? Since cached values by definition have not changed...

eeroel avatar Mar 10 '20 07:03 eeroel

The proposed workarounds (use checksum (path, refresh=True) or invalidate_cache (path)) don't seem to work, at least for a directory. Even if I get a new listing (by invalidating the cache and then using ls, checksum always seems to return the original cached value. I have to checksum the returned ls listing.

timcoote avatar Apr 28 '21 12:04 timcoote

I don't think a directory has a checksum, does it? It what manner are you changing it between calls?

martindurant avatar May 05 '21 14:05 martindurant

The usecase is that the files in the directory change, and so need to be reviewed. Hence my workaround. I would expect that if the contents of a directory changed, then the cache would be invalidated. If that's not the designed/specified behaviour, then it would be fine if the documentation pointed this out.

timcoote avatar May 18 '21 15:05 timcoote

Right, if the files change via some other process, s3fs doesn't know about this. Please do feel free to propose changes to the documentation to clarify things.

martindurant avatar May 18 '21 15:05 martindurant