dlt icon indicating copy to clipboard operation
dlt copied to clipboard

`drop_storage()` leaves empty folders dangling on ADLS Gen2

Open jorritsandbrink opened this issue 1 year ago • 4 comments

dlt version

0.3.25

Describe the problem

The drop_storage() method on class FilesystemClient does not completely remove all folders when using an Azure Data Lake Storage Account Gen2. I encountered this bug when running the test_successful_load test:

pytest tests/load/filesystem/test_filesystem_client.py::test_successful_load

This test calls drop_storage() as a final step to clean up test data, but in doing so, not all folders get removed. Here's a screenshot of the remaining folders of one of the test runs:

image

And here's a screenshot of another test run: image

As you can see, the first test runs ends up with three remaining folders, while the second test run ends up with two remaining folders. I've observed that the result is different each time I run the test, so it seems to be dependent on runtime circumstances.

Under the hood, drop_storage() simply calls the rm() method of the adlfs implementation for fsspec:

    def drop_storage(self) -> None:
        if self.is_storage_initialized():
            self.fs_client.rm(self.dataset_path, recursive=True)

Using a try-except, I can make the underlying error surface:

    def drop_storage(self) -> None:
        if self.is_storage_initialized():
            print('self.dataset_path:', self.dataset_path)
            try:
                self.fs_client.rm(self.dataset_path, recursive=True)
            except Exception as e:
                print('EXCEPTION:', e)

image

In the screenshot you can see that for this specific test run, 4 out of 12 calls to drop_storage() throw an error, while the other 8 run just fine.

The issue seems to be a known bug in the adlfs implementation of rm: https://github.com/fsspec/adlfs/issues/389. Someone is currently working on a fix for, so hopefully this issue gets solved at the root soon. If not, perhaps something can be done on the dlt side. Either way, I'm logging this as a bug here for visibility.

Expected behavior

No response

Steps to reproduce

On Windows, Python 3.8, Git Bash:

  1. Clone dlt repo
  2. Use master or devel branch
  3. Run poetry install -E az
  4. Run poetry shell
  5. Configure tests/.dlt/config.toml
ACTIVE_DESTINATIONS = ["filesystem"]
ALL_FILESYSTEM_DRIVERS = ["az"]

[destination.filesystem.credentials]
azure_storage_account_name = "synapseadslstorage"

[tests]
bucket_url_az="az://synapseadsldata/_test"
  1. Configure tests/.dlt/secret.toml
[destination.filesystem.credentials]
azure_storage_account_key = "ACCOUNT_KEY_HERE"
  1. Run pytest tests/load/filesystem/test_filesystem_client.py::test_successful_load

Operating system

Windows

Runtime environment

Local

Python version

3.8

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

I did not test this on a regular Azure Blob Storage Account, only on Azure Data Lake Storage Gen2.

jorritsandbrink avatar Dec 27 '23 20:12 jorritsandbrink

According to the linked issue this is now fixed, to me it seemed though that it is not while working on the filesystem state sync..

sh-rp avatar Apr 17 '24 13:04 sh-rp

@sh-rp We might need to increase the adlfs version requirement in pyproject.toml. Right now it's resolving to adlfs version 2023.8.0 which doesn't contain the fix.

jorritsandbrink avatar Apr 17 '24 13:04 jorritsandbrink

@jorritsandbrink @sh-rp OK to increase the version in lock file. not ok to increase the minimum version in deps. there's a lot of environments with fixed fsspec version and I already had problems by putting anything newer than this

but yeah - let's do that and see if our tests are passing

rudolfix avatar Apr 17 '24 14:04 rudolfix

I upgraded it last week to test it out and it did not solve the issue I was having while working on the filesystem state sync. I am not 100% sure though that this is the same thing you mention here.

sh-rp avatar Apr 22 '24 10:04 sh-rp

the problem was caused by fixtures that were not fired for all tests that created pipelines

rudolfix avatar Jun 27 '24 16:06 rudolfix