dlt `drop_storage()` leaves empty folders dangling on ADLS Gen2

dlt version

0.3.25

Describe the problem

The drop_storage() method on class FilesystemClient does not completely remove all folders when using an Azure Data Lake Storage Account Gen2. I encountered this bug when running the test_successful_load test:

pytest tests/load/filesystem/test_filesystem_client.py::test_successful_load

This test calls drop_storage() as a final step to clean up test data, but in doing so, not all folders get removed. Here's a screenshot of the remaining folders of one of the test runs:

And here's a screenshot of another test run:

As you can see, the first test runs ends up with three remaining folders, while the second test run ends up with two remaining folders. I've observed that the result is different each time I run the test, so it seems to be dependent on runtime circumstances.

Under the hood, drop_storage() simply calls the rm() method of the adlfs implementation for fsspec:

    def drop_storage(self) -> None:
        if self.is_storage_initialized():
            self.fs_client.rm(self.dataset_path, recursive=True)

Using a try-except, I can make the underlying error surface:

    def drop_storage(self) -> None:
        if self.is_storage_initialized():
            print('self.dataset_path:', self.dataset_path)
            try:
                self.fs_client.rm(self.dataset_path, recursive=True)
            except Exception as e:
                print('EXCEPTION:', e)

In the screenshot you can see that for this specific test run, 4 out of 12 calls to drop_storage() throw an error, while the other 8 run just fine.

The issue seems to be a known bug in the adlfs implementation of rm: https://github.com/fsspec/adlfs/issues/389. Someone is currently working on a fix for, so hopefully this issue gets solved at the root soon. If not, perhaps something can be done on the dlt side. Either way, I'm logging this as a bug here for visibility.

Expected behavior

No response

Steps to reproduce

On Windows, Python 3.8, Git Bash:

Clone dlt repo
Use master or devel branch
Run poetry install -E az
Run poetry shell
Configure tests/.dlt/config.toml

ACTIVE_DESTINATIONS = ["filesystem"]
ALL_FILESYSTEM_DRIVERS = ["az"]

[destination.filesystem.credentials]
azure_storage_account_name = "synapseadslstorage"

[tests]
bucket_url_az="az://synapseadsldata/_test"

Configure tests/.dlt/secret.toml

[destination.filesystem.credentials]
azure_storage_account_key = "ACCOUNT_KEY_HERE"

Run pytest tests/load/filesystem/test_filesystem_client.py::test_successful_load

Operating system

Windows

Runtime environment

Local

Python version

3.8

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

I did not test this on a regular Azure Blob Storage Account, only on Azure Data Lake Storage Gen2.

Dec 27 '23 20:12 jorritsandbrink

According to the linked issue this is now fixed, to me it seemed though that it is not while working on the filesystem state sync..

Apr 17 '24 13:04 sh-rp

@sh-rp We might need to increase the adlfs version requirement in pyproject.toml. Right now it's resolving to adlfs version 2023.8.0 which doesn't contain the fix.

Apr 17 '24 13:04 jorritsandbrink

@jorritsandbrink @sh-rp OK to increase the version in lock file. not ok to increase the minimum version in deps. there's a lot of environments with fixed fsspec version and I already had problems by putting anything newer than this

but yeah - let's do that and see if our tests are passing

Apr 17 '24 14:04 rudolfix

I upgraded it last week to test it out and it did not solve the issue I was having while working on the filesystem state sync. I am not 100% sure though that this is the same thing you mention here.

Apr 22 '24 10:04 sh-rp

the problem was caused by fixtures that were not fired for all tests that created pipelines

Jun 27 '24 16:06 rudolfix

dlt dlt copied to clipboard

`drop_storage()` leaves empty folders dangling on ADLS Gen2

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

dlt
dlt copied to clipboard