adlfs icon indicating copy to clipboard operation
adlfs copied to clipboard

`AzureBlobFileSystem.ls(path)` returns inconsistent results

Open sugibuchi opened this issue 1 year ago • 2 comments

(Revised 2023/03/30)

Versions:

  • adlfs==2023.1.0
  • fsspec==2023.3.0

(This problem seemingly started from adlfs==2022.11.0)

Summary

Suppose we have the following files in directory my_container/path/to.

path/
  to/
    dummy.txt
    dummy.txt.1

For this example, AzureBlobFileSystem.ls(path) returns the following inconsistent results:

  • my_container/path/to/: dummy.txt, dummy.txt.1 (OK)
  • my_container/path/to/dummy.txt: (empty) (KO)
  • my_container/path/to/dummy.txt.1: dummy.txt.1 (OK)
  • my_container/path/to/dummy.txt.: dummy.txt.1 (KO)

POC

from adlfs.spec import AzureBlobFileSystem
from azure.identity.aio import AzureCliCredential

fs = AzureBlobFileSystem(account_name="my_account", credential=AzureCliCredential())

fs.ls("my_container/path/to/", invalidate_cache=True)
> ['my_container/path/to/dummy.txt']

fs.ls("my_container/path/to/dummy.txt", invalidate_cache=True)
> ['my_container/path/to/dummy.txt']

Now we create an empty file dummy.txt.1 in the same directory. Then AzureBlobFileSystem starts returning inconsistent results.

fs.ls("my_container/path/to/", invalidate_cache=True)
> ['my_container/path/to/dummy.txt', 'my_container/path/to/dummy.txt.1']   # OK

fs.ls("my_container/path/to/dummy.txt", invalidate_cache=True)
> []  # KO

fs.ls("my_container/path/to/dummy.txt.1", invalidate_cache=True)
> ['my_container/path/to/dummy.txt.1']  # OK

fs.ls("my_container/path/to/dummy.txt.", invalidate_cache=True)
> ['my_container/path/to/dummy.txt.1']  # KO

This problem is probably linked to #406.

sugibuchi avatar Mar 29 '23 20:03 sugibuchi

I have encountered the same behavior

daavoo avatar Mar 30 '23 07:03 daavoo

This problem comes from the handling of target_path in _details(...) method.

https://github.com/fsspec/adlfs/blob/2023.1.0/adlfs/spec.py#L919-L933

        if target_path:
            if (
                len(output) == 1
                and output[0]["type"] == "file"
                and not self.version_aware
            ):
                # This handles the case where path is a file passed to ls
                return output
            output = await filter_blobs(
                output,
                target_path,
                delimiter,
                version_id=version_id,
                versions=versions,
            )

This code applies utils.filter_blobs to output. But utils.filter_blobs assumes that target_path is a directory path.

https://github.com/fsspec/adlfs/blob/2023.1.0/adlfs/utils.py#L34

This is the reason why ls("my_container/path/to/dummy.txt") in the example returns an empty result.

In addition, if output contains only one entry and the entry is a file, this code returns output as it is. As a result, we get nonempty result for ls("my_container/path/to/dummy.txt.1") (OK) and ls("my_container/path/to/dummy.txt.") (KO).

I believe that _details(...) is probably not the best place to apply this filtering (target_path of this method is not documented and used only by _ls_blobs(...)). I think the filtering of retrieved entries should be done in _is_blobs(...).

sugibuchi avatar Mar 30 '23 08:03 sugibuchi