s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

local caching of versioned remote files

Open valentin-iris opened this issue 3 years ago • 5 comments
trafficstars

Hi! I want to use s3fs for accessing testing files on S3 mainly bc of these 2 neat features:

  1. local caching of files to disk with checking if files change, i.e. a file gets redownloaded if the local and remote file differ
  2. file version id support for versioned S3 buckets, i.e. the ability to open different versions of the same remote file based on their version id

I don't need this for high frequency use and the files don't change often. It is mainly for using unit/integration test data stored on S3, which changes only if tests and related test data get updated (versions!).

I got both of the above working separately just fine, but it seems I can't get the combination of the two working. That is, I want to be able to cache different versions of the same file locally. It seems that as soon as you use a filecache, the version id disambiguation is lost.

fs = fsspec.filesystem("filecache", target_protocol='s3', cache_storage='/tmp/aws', check_files=True, version_aware=True)
with fs.open("s3://my_bucket/my_file.txt", "r", version_id=version_id) as f:
    text = f.read()

No matter what version_id is, I always get the most recent file from S3, which is also the one that gets cached locally.

What I expect is that I always get the correct file version and the local cache either keeps separate files for each version (preferred) or just updates the local file whenever I request a version different from the cached one.

Is there a way I can achieve this with the current state of the libraries or is this currently not possible? I am using s3fs==fsspec==2022.3.0.

valentin-iris avatar Apr 20 '22 08:04 valentin-iris

In fsspec.implementations.cached::CachingFileSystem._open, the hash of the target file is based on the path alone. This would need to be changed for your situation to include kwargs and possibly also the hash of the filesystem. That would lead to different local files when you specify a different version, but also when you change any other option in open (e.g., different readahead size) or in defining the filesystem (e.g., change the number of retries) even if the target file is the same. We could perhaps have (yet another) arg to specify how to make the hash; or keep lists of stored file for each hash and see if the remote details match any of them.

The caching/metadata story is in general in need of a refresh. I'm not sure when the effort can be found!

martindurant avatar Apr 20 '22 13:04 martindurant

Thanks for the clarification and the pointers! I would love to see a new and refreshed caching/metadata part that would enable such a combination and hope the time for it can be found :-)

In the meantime I will try to patch my own version that can accommodate at least the version_id in the filehash. Would you have a few more specific pointers for me as to where to get started with that?

Thanks a lot!

valentin-iris avatar Apr 21 '22 08:04 valentin-iris

Would you have a few more specific pointers for me as to where to get started with that

I think the comment above is all I have: look at how the hash is calculated during _open, and include anything version-related from kwargs. Just printing out what kwargs you see in that function as you specify versions should be enough to see how.

martindurant avatar Apr 28 '22 00:04 martindurant

@valentin-iris How did you resolve this limitation?

uhlajs avatar Jun 28 '22 17:06 uhlajs

Tbh I didn't continue with this. We have our team internal S3Operator where I have implemented that functionality, but haven't continued on the s3fs front since we are not using it in our team (yet).

Darkdragon84 avatar Jul 01 '22 11:07 Darkdragon84