dlt
dlt copied to clipboard
`drop_storage()` leaves empty folders dangling on ADLS Gen2
dlt version
0.3.25
Describe the problem
The drop_storage() method on class FilesystemClient does not completely remove all folders when using an Azure Data Lake Storage Account Gen2. I encountered this bug when running the test_successful_load test:
pytest tests/load/filesystem/test_filesystem_client.py::test_successful_load
This test calls drop_storage() as a final step to clean up test data, but in doing so, not all folders get removed. Here's a screenshot of the remaining folders of one of the test runs:
And here's a screenshot of another test run:
As you can see, the first test runs ends up with three remaining folders, while the second test run ends up with two remaining folders. I've observed that the result is different each time I run the test, so it seems to be dependent on runtime circumstances.
Under the hood, drop_storage() simply calls the rm() method of the adlfs implementation for fsspec:
def drop_storage(self) -> None:
if self.is_storage_initialized():
self.fs_client.rm(self.dataset_path, recursive=True)
Using a try-except, I can make the underlying error surface:
def drop_storage(self) -> None:
if self.is_storage_initialized():
print('self.dataset_path:', self.dataset_path)
try:
self.fs_client.rm(self.dataset_path, recursive=True)
except Exception as e:
print('EXCEPTION:', e)
In the screenshot you can see that for this specific test run, 4 out of 12 calls to drop_storage() throw an error, while the other 8 run just fine.
The issue seems to be a known bug in the adlfs implementation of rm: https://github.com/fsspec/adlfs/issues/389. Someone is currently working on a fix for, so hopefully this issue gets solved at the root soon. If not, perhaps something can be done on the dlt side. Either way, I'm logging this as a bug here for visibility.
Expected behavior
No response
Steps to reproduce
On Windows, Python 3.8, Git Bash:
- Clone
dltrepo - Use
masterordevelbranch - Run
poetry install -E az - Run
poetry shell - Configure
tests/.dlt/config.toml
ACTIVE_DESTINATIONS = ["filesystem"]
ALL_FILESYSTEM_DRIVERS = ["az"]
[destination.filesystem.credentials]
azure_storage_account_name = "synapseadslstorage"
[tests]
bucket_url_az="az://synapseadsldata/_test"
- Configure
tests/.dlt/secret.toml
[destination.filesystem.credentials]
azure_storage_account_key = "ACCOUNT_KEY_HERE"
- Run
pytest tests/load/filesystem/test_filesystem_client.py::test_successful_load
Operating system
Windows
Runtime environment
Local
Python version
3.8
dlt data source
No response
dlt destination
Filesystem & buckets
Other deployment details
No response
Additional information
I did not test this on a regular Azure Blob Storage Account, only on Azure Data Lake Storage Gen2.
According to the linked issue this is now fixed, to me it seemed though that it is not while working on the filesystem state sync..
@sh-rp We might need to increase the adlfs version requirement in pyproject.toml. Right now it's resolving to adlfs version 2023.8.0 which doesn't contain the fix.
@jorritsandbrink @sh-rp OK to increase the version in lock file. not ok to increase the minimum version in deps. there's a lot of environments with fixed fsspec version and I already had problems by putting anything newer than this
but yeah - let's do that and see if our tests are passing
I upgraded it last week to test it out and it did not solve the issue I was having while working on the filesystem state sync. I am not 100% sure though that this is the same thing you mention here.
the problem was caused by fixtures that were not fired for all tests that created pipelines