adlfs icon indicating copy to clipboard operation
adlfs copied to clipboard

AzureBlobFileSystem.expand_path(...) with recursive=True returns wrong result

Open scott-zhou opened this issue 3 years ago • 1 comments

What happened: I first observed it when I get file with recursive=True argument. For example I have following file on Azure Datalake Gen2: data/tmp/1.txt And I want to download it to local folder /tmp/abc/1.txt, before I call get, the file /tmp/abc/1.txt is not existed on local disk.

I use recursive=True, because I think even it did not make too much sense, but it shouldn't break anything. But after I run fs.get(rpath="data/tmp/1.txt", lpath="/tmp/abc/1.txt", recursive=True, overwrite=True), I got following folder struct in my local folder: /tmp/abc/1.txt/1.txt get(...) created one extra layer of directory.

After debug, I found the root cause is in AzureBlobFileSystem.expand_path(...), when I call fs.expand_path("data/tmp/1.txt", recursive=True), it will return the following array:

[
    "data/tmp/",
    "data/tmp/1.txt"
]

and fs.expand_path("data/tmp/1.txt", recursive=False) returns correct array: ["data/tmp/1.txt"]

This don't seems correct, I will expect expand_path returns same value on a file what ever recursive is True or False.

I found the issue and want to make a bug fix, the issue should be here:

    AzureBlobFileSystem::_expand_path(...)
                elif recursive:
                    rec = set(
    line 1387           await self._find(p, withdirs=True, with_parent=with_parent)
                    )

withdirs should be False.

When I try to write test case for it by add test case in tests/test_spec.py, I found strange thing. AzureBlobFileSystem.expand_path(...) works all fine in test case with or without this fix.

That means the behavior is different in test case compare with when we actually use adlfs to access an Azure Storage Account. I guess that's because of the test case is not use a real Azure Storage Account. But I don't have any proof.

Can you help me to have a better understand about the behavior?

What you expected to happen:

  • AzureBlobFileSystem.expand_path(...) should works well with real Azure Storage Account
  • The behavior in test case should consistent with real Azure Storage Account

Minimal Complete Verifiable Example: Use adlfs to get file from real Azure Storage account, with argument recursive=True, or expand_path for a file with recursive=True argument

`fs.expand_path("data/tmp/1.txt", recursive=True)`
# or 
`fs.get(rpath="data/tmp/1.txt", lpath="/tmp/abc/1.txt", recursive=True, overwrite=True)`

And try same thing in test case without change code, you should able to observe different result from fs.expand_path

Anything else we need to know?: N/A

Environment:

  • Dask version: N/A
  • Python version: 3.7/3.8
  • Operating System: OSX, Linux
  • Install method (conda, pip, source): all same result

scott-zhou avatar May 28 '21 20:05 scott-zhou

I find the different behavior for unittest env vs real Azure Storage Account, is the function ContainerClient.list_blobs returns different values. For example:

  • In unittest, container_client.list_blobs(name_starts_with="data/tmp/1.txt/") return empty list
  • with real Azure Storage Account, container_client.list_blobs(name_starts_with="data/tmp/1.txt/") return list with 1 element ["container/data/tmp/1.txt"]

Not sure which is correct, for me it looks like the real Azure Storage Account's return value not make too much sense. But however, we should align to it even it's not perfect, right?

scott-zhou avatar Jun 01 '21 08:06 scott-zhou