adlfs
adlfs copied to clipboard
`AzureBlobFileSystem.ls(path)` returns inconsistent results
(Revised 2023/03/30)
Versions:
-
adlfs==2023.1.0
-
fsspec==2023.3.0
(This problem seemingly started from adlfs==2022.11.0
)
Summary
Suppose we have the following files in directory my_container/path/to
.
path/
to/
dummy.txt
dummy.txt.1
For this example, AzureBlobFileSystem.ls(path)
returns the following inconsistent results:
-
my_container/path/to/
:dummy.txt
,dummy.txt.1
(OK) -
my_container/path/to/dummy.txt
: (empty) (KO) -
my_container/path/to/dummy.txt.1
:dummy.txt.1
(OK) -
my_container/path/to/dummy.txt.
:dummy.txt.1
(KO)
POC
from adlfs.spec import AzureBlobFileSystem
from azure.identity.aio import AzureCliCredential
fs = AzureBlobFileSystem(account_name="my_account", credential=AzureCliCredential())
fs.ls("my_container/path/to/", invalidate_cache=True)
> ['my_container/path/to/dummy.txt']
fs.ls("my_container/path/to/dummy.txt", invalidate_cache=True)
> ['my_container/path/to/dummy.txt']
Now we create an empty file dummy.txt.1
in the same directory. Then AzureBlobFileSystem
starts returning inconsistent results.
fs.ls("my_container/path/to/", invalidate_cache=True)
> ['my_container/path/to/dummy.txt', 'my_container/path/to/dummy.txt.1'] # OK
fs.ls("my_container/path/to/dummy.txt", invalidate_cache=True)
> [] # KO
fs.ls("my_container/path/to/dummy.txt.1", invalidate_cache=True)
> ['my_container/path/to/dummy.txt.1'] # OK
fs.ls("my_container/path/to/dummy.txt.", invalidate_cache=True)
> ['my_container/path/to/dummy.txt.1'] # KO
This problem is probably linked to #406.
I have encountered the same behavior
This problem comes from the handling of target_path
in _details(...)
method.
https://github.com/fsspec/adlfs/blob/2023.1.0/adlfs/spec.py#L919-L933
if target_path:
if (
len(output) == 1
and output[0]["type"] == "file"
and not self.version_aware
):
# This handles the case where path is a file passed to ls
return output
output = await filter_blobs(
output,
target_path,
delimiter,
version_id=version_id,
versions=versions,
)
This code applies utils.filter_blobs
to output
. But utils.filter_blobs
assumes that target_path
is a directory path.
https://github.com/fsspec/adlfs/blob/2023.1.0/adlfs/utils.py#L34
This is the reason why ls("my_container/path/to/dummy.txt")
in the example returns an empty result.
In addition, if output
contains only one entry and the entry is a file, this code returns output
as it is. As a result, we get nonempty result for ls("my_container/path/to/dummy.txt.1")
(OK) and ls("my_container/path/to/dummy.txt.")
(KO).
I believe that _details(...)
is probably not the best place to apply this filtering (target_path
of this method is not documented and used only by _ls_blobs(...)
). I think the filtering of retrieved entries should be done in _is_blobs(...)
.