adlfs icon indicating copy to clipboard operation
adlfs copied to clipboard

Support DataLakeServiceClient for Azure Gen 2 Storage

Open KeerthiYandaOS opened this issue 2 years ago • 5 comments

Currently, adlfs is using BlobServiceClient for both ADLS Gen 2 and Blob storage. Blob client is using blob apis underneath but if we want to use data lake apis currently there isn't an option. Can we support the client instance as input so that user can pass either BlobServiceClient or DataLakeServiceClient based on the usage?

KeerthiYandaOS avatar Feb 23 '23 23:02 KeerthiYandaOS

Thanks for opening this issue. A couple questions:

  1. Can you share an example of an operation that's enabled by the data lake APIs that isn't possible (or is maybe slower) with the Blob Storage APIs? Just trying to understand why a user might want this.
  2. Do you have a suggestion for how this might be implemented, and what the user facing API would be?

It looks like historically this library had two implementations: one for Data Lake (Gen 1) and one for Blob Storage. API-wise, would we want to keep these separate? Or would we want a single AzureBlobFileSystem with a keyword that controls the underlying Azure client we use?

TomAugspurger avatar Feb 24 '23 14:02 TomAugspurger

Not OP here, but the initial sales pitch for data lake gen2 when it was launched was that it understands file system structure. The name is a bit unfortunate, because it suggests that the product is related to data lake gen1, which it really isn't, to any significant degree. I always considered it "blob storage with first class folder structure".

With azure-storage-file-datalake listing the contents of a directory is very fast and you can expand the file tree one level at a time. If I recall correctly, you need to do prefix/glob-match with BlobServiceClient (this might not be true anymore, it's been years since I worked with it). Depending on use-cases, that might be very slow. For datasets with many partitions, it makes a pretty big difference if you mostly access them with partition filters since listing blobs is/used to be so slow. I guess you also have atomic/cheap renames of folders, which I can't imagine is easy to achieve with the blob API.

Data lake gen2 also supports a bunch of things that I don't think are relevant to this project, like setting up ACL/RBAC for folders, that aren't supported by blog storage. People who also use adlfs may be using azure-storage-file-datalake to do those things before/after writing data, so they may already have a configured client instance available but that seems like a pretty weak reason to take on the complexity of supporting both clients.

kaaveland avatar Mar 02 '23 20:03 kaaveland

@efiop @hayesgb this will increase the speed a lot, please take a look :)

WaterKnight1998 avatar Aug 01 '23 12:08 WaterKnight1998

ADLSv2 storage supports requesting just blobs from a specific directory with the get_paths() method. This might benefit #388.

daviewales avatar Mar 12 '24 01:03 daviewales

Also renaming directories (kind of move with recurse=True) is a case that's currently not possible on an Datalake v2 Account

aersam avatar Mar 14 '24 12:03 aersam