adlfs
adlfs copied to clipboard
`AzureBlobFileSystem.ls(path)` returns inconsistent results
(Revised 2023/03/30)
Versions:
adlfs==2023.1.0fsspec==2023.3.0
(This problem seemingly started from adlfs==2022.11.0)
Summary
Suppose we have the following files in directory my_container/path/to.
path/
to/
dummy.txt
dummy.txt.1
For this example, AzureBlobFileSystem.ls(path) returns the following inconsistent results:
my_container/path/to/:dummy.txt,dummy.txt.1(OK)my_container/path/to/dummy.txt: (empty) (KO)my_container/path/to/dummy.txt.1:dummy.txt.1(OK)my_container/path/to/dummy.txt.:dummy.txt.1(KO)
POC
from adlfs.spec import AzureBlobFileSystem
from azure.identity.aio import AzureCliCredential
fs = AzureBlobFileSystem(account_name="my_account", credential=AzureCliCredential())
fs.ls("my_container/path/to/", invalidate_cache=True)
> ['my_container/path/to/dummy.txt']
fs.ls("my_container/path/to/dummy.txt", invalidate_cache=True)
> ['my_container/path/to/dummy.txt']
Now we create an empty file dummy.txt.1 in the same directory. Then AzureBlobFileSystem starts returning inconsistent results.
fs.ls("my_container/path/to/", invalidate_cache=True)
> ['my_container/path/to/dummy.txt', 'my_container/path/to/dummy.txt.1'] # OK
fs.ls("my_container/path/to/dummy.txt", invalidate_cache=True)
> [] # KO
fs.ls("my_container/path/to/dummy.txt.1", invalidate_cache=True)
> ['my_container/path/to/dummy.txt.1'] # OK
fs.ls("my_container/path/to/dummy.txt.", invalidate_cache=True)
> ['my_container/path/to/dummy.txt.1'] # KO
This problem is probably linked to #406.
I have encountered the same behavior
This problem comes from the handling of target_path in _details(...) method.
https://github.com/fsspec/adlfs/blob/2023.1.0/adlfs/spec.py#L919-L933
if target_path:
if (
len(output) == 1
and output[0]["type"] == "file"
and not self.version_aware
):
# This handles the case where path is a file passed to ls
return output
output = await filter_blobs(
output,
target_path,
delimiter,
version_id=version_id,
versions=versions,
)
This code applies utils.filter_blobs to output. But utils.filter_blobs assumes that target_path is a directory path.
https://github.com/fsspec/adlfs/blob/2023.1.0/adlfs/utils.py#L34
This is the reason why ls("my_container/path/to/dummy.txt") in the example returns an empty result.
In addition, if output contains only one entry and the entry is a file, this code returns output as it is. As a result, we get nonempty result for ls("my_container/path/to/dummy.txt.1") (OK) and ls("my_container/path/to/dummy.txt.") (KO).
I believe that _details(...) is probably not the best place to apply this filtering (target_path of this method is not documented and used only by _ls_blobs(...)). I think the filtering of retrieved entries should be done in _is_blobs(...).