filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

`filecache` and `simplecache` both ignore `version_id` for versioned S3 filesystems

Open AndreiBarsan opened this issue 4 years ago • 11 comments

Caching S3 data with e.g., simplecache works fine. However, S3FS also supports versioned buckets.

When a version_id is provided to open, using a cache leads to incorrect behavior, as the cache seems to drop the version argument and always requests, and subsequently caches, the latest version.

The code below should reproduce the issue, as long as you provide it with the bucket name for a versioned bucket. Removing the caching fixes the issue.

Possible fixes:

  • make caching aware of S3 version IDs (may need to "hack" the local disk cache paths to include the version IDs)
  • make caches raise an error when they detect an attempt to wrap a versioned bucket, in order to protect the user from this surprising behavior

Please let me know if I can provide any additional information or code!

import random
import time

import fsspec
import numpy as np
import s3fs


def eval_version_caching():
    bucket = "MY_TEST_BUCKET"

    fs = s3fs.S3FileSystem(version_aware=True)
    print(fs.ls(bucket))

    version_dir = f"tmp/dbg-version-eval-{random.randint(0, 100000):010d}"
    data_path = f"{bucket}/{version_dir}/data.npy"

    # 8 * 2 * 2 Mb, roughly = 32 Mb
    some_data = np.ones((2000, 2000), dtype=np.float64) * 1.0
    some_other_data = np.ones((2000, 2000), dtype=np.float64) * 42.0

    with fs.open(data_path, "wb") as f_out:
        np.save(f_out, some_data)

    print("Saved first version... Waiting to flush.")
    time.sleep(1.0)

    with fs.open(data_path, "wb") as f_out:
        np.save(f_out, some_other_data)

    print("Saved second version... Waiting to flush.")
    time.sleep(1.0)

    versions = fs.object_version_info(data_path)
    assert 2 == len(versions)

    # Versions are listed in reverse chronological order. In our case, we call the first version "a", and the second
    # version "b".
    id_rev_a = versions[1]["VersionId"]
    id_rev_b = versions[0]["VersionId"]
    print(id_rev_a)
    print(id_rev_b)

    # The cached FS always returns the latest version, no matter what version_id we provide.
    cached_fs = fsspec.filesystem(
        "filecache", target_protocol="s3", target_options={"version_aware": True}, cache_storage="/tmp/s3_file_cache"
    )
    # Using this non cached filesystem correctly retrieves the correct versions
    # non_cached_fs = s3fs.S3FileSystem(version_aware=True)
    with cached_fs.open(data_path, "rb", version_id=id_rev_a) as f_out:
        read_data = np.load(f_out)
        print("Revision A:")
        print(read_data.mean())
        rev_a_mean = read_data.mean()

    with cached_fs.open(data_path, "rb", version_id=id_rev_b) as f_out:
        read_data = np.load(f_out)
        print("Revision B:")
        print(read_data.mean())
        rev_b_mean = read_data.mean()

    assert abs(rev_a_mean - rev_b_mean) > 1e-5



def main():
    eval_version_caching()


if __name__ == "__main__":
    main()

AndreiBarsan avatar May 11 '21 21:05 AndreiBarsan

Yes, you are quite right: the cache only uses the filename as input to derive the local path, and (apparently) doesn't pass arguments on. For the specific case of s3, you can embed the version into the filename "bucket/path/key?versionId=..", so that might be enough.

martindurant avatar May 12 '21 12:05 martindurant

Thank you for the response!

But in this case we would have to generate our own version IDs, and can't rely on S3's generated ones, right? That is, our full file name would have to be "bucket/path/key?versionId=xyz". Otherwise cached_fs.open(data_path + f"?version_id={id_rev_b}", "rb") as f_out: does not seem to automatically pass the version param to S3.

AndreiBarsan avatar May 12 '21 12:05 AndreiBarsan

Could _strip_protocol be used in the S3 fs implementation to address this, maybe? Like how in local filesystems it does things like os.expanduser?

https://github.com/intake/filesystem_spec/blob/e734622e2b837625d5c8f27477d6968d837f68b8/fsspec/implementations/local.py#L148

AndreiBarsan avatar May 12 '21 13:05 AndreiBarsan

I would have expected any version info embedded in the path to be passed through - but this is not tested, of course. Note the difference between versionId (part of the path) and version_id (keyword argument).

martindurant avatar May 12 '21 13:05 martindurant

Yeah, I think fsspec crashes before we can actually make the request. There seems to be some logic for handling the magic part of the URI (it figures out something special is going on due to the ?) which raises the error in AsyncFileSystem._expand_path. I'll try to debug it for a bit longer but I'm not sure if I'll get the chance to dig deep enough today.

AndreiBarsan avatar May 12 '21 14:05 AndreiBarsan

I appreciate you having any time at all to look into it :)

martindurant avatar May 12 '21 14:05 martindurant

I think I root-caused the bug but fixing it is more involved since we need to ensure both sync and async implementations work, other bugs are not added, etc.

The idea is fsspec calls _expand_path, which in turn seems to try to glob that path. The glob function cleans the path correctly, removing the versionId part, but when preparing the pattern to actually glob, it forgets to remove versionId. This results in a glob pattern that no longer matches the queried path, leading to the filesystem always reporting the file as missing!

(To reiterate: this happens in the caching layer, so disabling that bypasses this whole problem.)

A hack that solves my problem (but may break other things) is making the _glob in asyn.py use the magic-less version of the path to construct the glob pattern:

        if not has_magic(path):
            root = path
            non_magic_path = path
            depth = 1
            if ends:
                path += "/*"
            elif await self._exists(path):
                if not detail:
                    return [path]
                else:
                    return {path: await self._info(path)}
            else:
                if not detail:
                    return []  # glob of non-existent returns empty
                else:
                    return {}
        elif "/" in path[:ind]:
            ind2 = path[:ind].rindex("/")
            non_magic_path = path[:ind]
            root = path[: ind2 + 1]
            depth = None if "**" in path else path[ind2 + 1 :].count("/") + 1
        else:
            root = ""
            non_magic_path = ""
            depth = None if "**" in path else path[ind + 1 :].count("/") + 1

        allpaths = await self._find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
        # Escape characters special to python regex, leaving our supported
        # special characters in place.
        # See https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
        # for shell globbing details.
        pattern = (
            "^"
            + (
                non_magic_path.replace("\\", r"\\")
                .replace(".", r"\.")
                .replace("+", r"\+")
                .replace("//", "/")
                .replace("(", r"\(")
                .replace(")", r"\)")
                .replace("|", r"\|")
                .replace("^", r"\^")
                .replace("$", r"\$")
                .replace("{", r"\{")
                .replace("}", r"\}")
                .rstrip("/")
                .replace("?", ".")
            )
            + "$"
        )

AndreiBarsan avatar May 12 '21 14:05 AndreiBarsan

@mariusvniekerk - any thought on how glob should work for the case of paths containing version specifiers? I guess it should not at all, and cached_fs.open should not be calling _expand_path.

@AndreiBarsan , maybe the simplest fix, which should also be done anyway, is to ensure that open passes on kwargs to the target filesystem, so that you don't need to use the complex paths at all. That will still mean that different versions will overwrite each other in the cache (or fetch the wrong one) unless we account for kwargs when deriving the local path.

martindurant avatar May 12 '21 14:05 martindurant

If the object exists with the versionid glob should just echo it back?

mariusvniekerk avatar May 13 '21 00:05 mariusvniekerk

oh, just ran into this old issue

  • is it still pertinent?
  • is it also pertinent to http URLs to s3 with versionId?

yarikoptic avatar Aug 04 '22 15:08 yarikoptic

I believe this is still the case. If you have a HTTP URL embedding the version, that should be fine.

martindurant avatar Aug 04 '22 15:08 martindurant