adlfs
adlfs copied to clipboard
AzureBlobFileSystem.expand_path(...) with recursive=True returns wrong result
What happened:
I first observed it when I get file with recursive=True argument. For example I have following file on Azure Datalake Gen2:
data/tmp/1.txt
And I want to download it to local folder /tmp/abc/1.txt
, before I call get, the file /tmp/abc/1.txt
is not existed on local disk.
I use recursive=True, because I think even it did not make too much sense, but it shouldn't break anything. But after I run fs.get(rpath="data/tmp/1.txt", lpath="/tmp/abc/1.txt", recursive=True, overwrite=True)
, I got following folder struct in my local folder:
/tmp/abc/1.txt/1.txt
get(...)
created one extra layer of directory.
After debug, I found the root cause is in AzureBlobFileSystem.expand_path(...)
, when I call fs.expand_path("data/tmp/1.txt", recursive=True)
, it will return the following array:
[
"data/tmp/",
"data/tmp/1.txt"
]
and fs.expand_path("data/tmp/1.txt", recursive=False)
returns correct array: ["data/tmp/1.txt"]
This don't seems correct, I will expect expand_path
returns same value on a file what ever recursive
is True
or False
.
I found the issue and want to make a bug fix, the issue should be here:
AzureBlobFileSystem::_expand_path(...)
elif recursive:
rec = set(
line 1387 await self._find(p, withdirs=True, with_parent=with_parent)
)
withdirs
should be False
.
When I try to write test case for it by add test case in tests/test_spec.py
, I found strange thing. AzureBlobFileSystem.expand_path(...)
works all fine in test case with or without this fix.
That means the behavior is different in test case compare with when we actually use adlfs to access an Azure Storage Account. I guess that's because of the test case is not use a real Azure Storage Account. But I don't have any proof.
Can you help me to have a better understand about the behavior?
What you expected to happen:
-
AzureBlobFileSystem.expand_path(...)
should works well with real Azure Storage Account - The behavior in test case should consistent with real Azure Storage Account
Minimal Complete Verifiable Example:
Use adlfs to get file from real Azure Storage account, with argument recursive=True
, or expand_path for a file with recursive=True
argument
`fs.expand_path("data/tmp/1.txt", recursive=True)`
# or
`fs.get(rpath="data/tmp/1.txt", lpath="/tmp/abc/1.txt", recursive=True, overwrite=True)`
And try same thing in test case without change code, you should able to observe different result from fs.expand_path
Anything else we need to know?: N/A
Environment:
- Dask version: N/A
- Python version: 3.7/3.8
- Operating System: OSX, Linux
- Install method (conda, pip, source): all same result
I find the different behavior for unittest env vs real Azure Storage Account, is the function ContainerClient.list_blobs returns different values. For example:
- In unittest,
container_client.list_blobs(name_starts_with="data/tmp/1.txt/")
return empty list - with real Azure Storage Account,
container_client.list_blobs(name_starts_with="data/tmp/1.txt/")
return list with 1 element ["container/data/tmp/1.txt"]
Not sure which is correct, for me it looks like the real Azure Storage Account's return value not make too much sense. But however, we should align to it even it's not perfect, right?