filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

DirFileSystem.ls(path, detail=True) hits assertion error with Azure Blob Storage

Open benlindsay opened this issue 3 years ago • 12 comments

I ran into an error using DirFileSystem with Azure Blob Storage. Here's the minimum code to expose this error:

import fsspec
from fsspec.implementations.dirfs import DirFileSystem

bucket = "my-bucket"

fs = fsspec.filesystem("abfs")
dirfs = DirFileSystem(path=bucket, fs=fs)
with fs.open(f"{bucket}/some-dir/some-file.txt", "w") as f:
    f.write("Hello, fsspec!")

print(dirfs.ls("some-dir", detail=True))
print(dirfs.ls("some-dir", detail=True))

When I run it, the first dirfs.ls() call works fine, but the second hits this AssertionError:

Traceback (most recent call last):
  File "test_dirfs_min.py", line 12, in <module>
    print(dirfs.ls("some-dir", detail=True))
  File "/usr/local/lib/python3.8/site-packages/fsspec/implementations/dirfs.py", line 213, in ls
    entry["name"] = self._relpath(entry["name"])
  File "/usr/local/lib/python3.8/site-packages/fsspec/implementations/dirfs.py", line 41, in _relpath
    assert path.startswith(prefix)
AssertionError

I believe this error happens because the first time dirfs.ls() is called, when the entry["name"] = self._relpath(entry["name"]) line runs, that modifies the object(s?) that are cached. Then the second time dirfs.ls() is called, the accidentally-modified cache is returned in this line (looks like this in case that line changes in the future)

ret = self.fs.ls(self._join(path), detail=detail, **kwargs)

Then the paths returned don't start with the prefix, triggering the AssertionError above.

I haven't traced through the caching logic, so there's probably a better way to solve this, but deepcopying the ret object before modifying it fixed the issue for me. See this commit in my fork.

Here's some slightly more extensive code that only fails at the very end with the current master branch, but runs through with my patch. Note that when testing "file", "memory", and "abfs" filesystems, we only run into the issue when using a DirFileSystem with Azure Blob Storage and setting detail=True. All other combinations work fine with the current code.

from pathlib import Path

import fsspec
from fsspec.implementations.dirfs import DirFileSystem

bucket = "my-bucket"

options_dicts = [
    {"fs_type": "file", "base_dir": f"{Path(__file__).parent}/{bucket}"},
    {"fs_type": "memory", "base_dir": bucket},
    {"fs_type": "abfs", "base_dir": bucket},
]

for options in options_dicts:
    fs_type = options["fs_type"]
    base_dir = options["base_dir"]
    print(f"Testing {fs_type}:")
    fs = fsspec.filesystem(fs_type, auto_mkdir=True)
    with fs.open(f"{base_dir}/some-dir/some-file.txt", "w") as f:
        f.write("Hello, fsspec!")
    print(fs.ls(f"{base_dir}/some-dir", detail=False))
    print(fs.ls(f"{base_dir}/some-dir", detail=False))
    print(fs.ls(f"{base_dir}/some-dir", detail=True))
    print(fs.ls(f"{base_dir}/some-dir", detail=True))

    dirfs = DirFileSystem(path=base_dir, fs=fs)
    print(dirfs.ls("some-dir", detail=False))
    print(dirfs.ls("some-dir", detail=False))
    print(dirfs.ls("some-dir", detail=True))
    print(dirfs.ls("some-dir", detail=True))

Would you like me to make a PR from my fork or is there a different way we should go about solving this problem?

benlindsay avatar Mar 18 '22 13:03 benlindsay

As an additional note, if you change the last line of the bigger code block above to print(dirfs.ls("some-dir", detail=True, invalidate_cache=True)). Not saying that asking users to know to do that is the right solution, but it's more evidence that it's a caching issue.

benlindsay avatar Mar 18 '22 13:03 benlindsay

@lucmos , do you think this is something to fix in DirFileSystem (like the linked call to deepcopy()), or something for abfs to do?

martindurant avatar Mar 21 '22 14:03 martindurant

Mmmh I am not sure, I never dug deep into the cache logic of the abfs.

My initial idea of DirFileSystem was to be a simple wrapper, thus the cache of the underlying fs should not be modified by the dirfs in my opinion. Maybe @efiop has a better overview?

lucmos avatar Mar 21 '22 23:03 lucmos

Definitely want to hear @efiop's thoughts, but I would agree that since it's supposed to be a simple wrapper it should be DirFileSystem's responsibility not to mess with the filesystem object it wraps. Copying the output seems to me like the most straightforward solution, but I don't know how much it would affect performance for large responses. My naive guess would be not much though

benlindsay avatar Mar 22 '22 00:03 benlindsay

Guys, sorry for the late response, busy with current events.

dirfs acts as an fs user in this case, and this raises a more general question if fs.ls(and other methods like info) users should know that they shouldn't modify the return values in place and need to copy/deepcopy them first. I think that users should not worry about this and our caching logic should return a new instance that is not tied to the cache. Caching is our internal implementation detail/optimization that shouldn't affect user workflows.

From a quick glance, it seems like we just need to copy the value from cache in (and below) https://github.com/fsspec/filesystem_spec/blob/44f7b16f7397edfaf60a237d3d70c87dbeabfe7e/fsspec/spec.py#L332 instead of returning it directly. It doesn't seem like this will result in any major performance implications, as the main purpose of this caching is saving on io, which is way slower than copying a dict anyway.

efiop avatar Apr 02 '22 17:04 efiop

Just saw from your GitHub profile that you're in or from Ukraine. I hope you and your loved ones stay safe. My heart goes out to you.

100% agree with your assessment. I just couldn't figure out the caching logic well enough to know if there was a single point like that to throw a copy in to take care of all filesystem types

benlindsay avatar Apr 02 '22 18:04 benlindsay

@benlindsay Thank you! ❤️

Would you be able to give that approach a try, to see if it fixes it for you, please?

efiop avatar Apr 02 '22 23:04 efiop

I must admit to my shame that for some reason I thought you were in Poland. I'm not sure of appropriate wishes for such circumstances; but let's hope for a better future.

martindurant avatar Apr 03 '22 00:04 martindurant

@efiop I gave your suggestion a try with this commit: https://github.com/benlindsay/filesystem_spec/commit/ddbbc4d20e53029f7c8e077e32dbb3e22adcdaab

Unfortunately that didn't work. Looking at the adlfs caching implementation, it seems like the _ls_from_cache() method is only used in info() (https://github.com/fsspec/adlfs/blob/master/adlfs/spec.py#L626) and exists() (https://github.com/fsspec/adlfs/blob/master/adlfs/spec.py#L1382), but not in ls(). Instead, they put a separate caching implementation in the _ls() function (invoked by ls()), as seen here: https://github.com/fsspec/adlfs/blob/master/adlfs/spec.py#L805-L820. So while in an ideal world, copying from within _ls_from_cache() should fix this problem, since implementations don't uniformly use that function as they (probably?) should, it looks like we either have to get all the implementations to unify their caching logic or move up a level and handle copying in DirFileSystem. Thoughts?

benlindsay avatar Apr 04 '22 15:04 benlindsay

@benlindsay Thank you so much for trying it out! 🙏

Regarding the proper solution, I think we indeed need to fix it in _ls_from_cache and in that particular place in adlfs. I wouldn't feel comfortable merging a dirfs hack, for an issue that is easy to work around in user code and that we will need to properly fix anyway (1 PR for hack vs 2 PRs for a proper solution).

efiop avatar Apr 05 '22 20:04 efiop

Getting the same problem with gcsfs.GCSFileSystem

willofferfit avatar Aug 20 '23 09:08 willofferfit

On current fsspec/gcsfs main, this seems to work ok:

In [1]: import fsspec
   ...: from fsspec.implementations.dirfs import DirFileSystem
   ...:
   ...: bucket = "mdtemp"
   ...:
   ...: fs = fsspec.filesystem("gcs")
   ...: dirfs = DirFileSystem(path=bucket, fs=fs)
   ...: with fs.open(f"{bucket}/some-dir/some-file.txt", "w") as f:
   ...:     f.write("Hello, fsspec!")
   ...:
   ...: print(dirfs.ls("some-dir", detail=True))
   ...: print(dirfs.ls("some-dir", detail=True))
[{'kind': 'storage#object', 'id': 'mdtemp/some-dir/some-file.txt/1692987500890222', 'selfLink': 'https://www.googleapis.com/storage/v1/b/mdtemp/o/some-dir%2Fsome-file.txt', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/mdtemp/o/some-dir%2Fsome-file.txt?generation=1692987500890222&alt=media', 'name': 'some-dir/some-file.txt', 'bucket': 'mdtemp', 'generation': '1692987500890222', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 14, 'md5Hash': 'MPUFhlkusB6i8Lu9W9zmoA==', 'crc32c': 'kQVipA==', 'etag': 'CO6Q5vS1+IADEAE=', 'timeCreated': '2023-08-25T18:18:20.924Z', 'updated': '2023-08-25T18:18:20.924Z', 'timeStorageClassUpdated': '2023-08-25T18:18:20.924Z', 'type': 'file', 'mtime': datetime.datetime(2023, 8, 25, 18, 18, 20, 924000, tzinfo=datetime.timezone.utc), 'ctime': datetime.datetime(2023, 8, 25, 18, 18, 20, 924000, tzinfo=datetime.timezone.utc)}]
[{'kind': 'storage#object', 'id': 'mdtemp/some-dir/some-file.txt/1692987500890222', 'selfLink': 'https://www.googleapis.com/storage/v1/b/mdtemp/o/some-dir%2Fsome-file.txt', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/mdtemp/o/some-dir%2Fsome-file.txt?generation=1692987500890222&alt=media', 'name': 'some-dir/some-file.txt', 'bucket': 'mdtemp', 'generation': '1692987500890222', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 14, 'md5Hash': 'MPUFhlkusB6i8Lu9W9zmoA==', 'crc32c': 'kQVipA==', 'etag': 'CO6Q5vS1+IADEAE=', 'timeCreated': '2023-08-25T18:18:20.924Z', 'updated': '2023-08-25T18:18:20.924Z', 'timeStorageClassUpdated': '2023-08-25T18:18:20.924Z', 'type': 'file', 'mtime': datetime.datetime(2023, 8, 25, 18, 18, 20, 924000, tzinfo=datetime.timezone.utc), 'ctime': datetime.datetime(2023, 8, 25, 18, 18, 20, 924000, tzinfo=datetime.timezone.utc)}]

martindurant avatar Aug 25 '23 18:08 martindurant