delta-rs
delta-rs copied to clipboard
Support writes to Azure Storage
Description
The delta-rs Azure Storage backend currently doesn't support write operations: https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L1 As a result, Delta Tables can not be written to Azure Storage.
Several methods need to be implemented: https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L192 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L196 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L200
The delta-rs Azure Storage backend currently uses the Blob API to access the storage account (ContainerClient
), which does not support atomic rename (rename_obj_noreplace). We need to use the ADLS Gen2 API, which does support atomic rename.
Both APIs can be used in parallel with ADLS Gen2 storage accounts, but we may want to simply switch all operations over to the ADLS Gen2 API unless there is an advantage to using the Blob API for some operations.
The other operations required by the delta-rs Azure Storage backend are: https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L122 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L141 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L157
For atomic rename to work, the actual Azure Storage Account resource being used must have hierarchical namespaces enabled, which is what makes it an "ADLS Gen2" storage account.
Apologies for any remaining confusion, this is a bit tricky to explain.
Also, the Azure Rust SDK doesn't yet support write operations via the ADLS Gen API, but work is underway.
Use Case Write Delta Tables to Azure Storage
Related Issue(s) None
I am considering looking into this once (if :)) #486 gets merged. MY assumption would be that delete_obj
and put_obj
could be implemented without the need for hierarchical namespaces enabled. However I am still a bit confused what constitutes gen2... As far as I understand, the *.dfs.core.windows.net endpoint is available on all recent (v2) storage accounts, if HNS is enabled, it becomes ADLS Gen 2, and the file_system APIs / directories become available? If HNS is a requirement, likely #60 should then be addressed as well somewhere.
Thanks @roeap, that would be awesome. Your assumptions above are all correct. Only HNS supports atomic renames, which will obviate the need for the lock service that we have to use for S3.
@roeap some things to be aware of:
- The Azure Rust SDK doesn't have any crate releases so can't be included in a delta-rs crate release yet (hopefully soon)
- The Azure Rust SDK doesn't have the ADLS Gen2 atomic rename operation yet, but I've been working up to adding it. I just added a bunch of operations and
rename_file_if_not_exists
is next.
Great to have a collaborator in making delta-rs work with Azure!
PR for #60 is #492
@thovoll, thanks for the pointers, and excited to see that the azure-sdk-for-rust repo seems to see some increased activity recently - I'm guessing pushing for a release of the crates... hopefully we can bring the azure support throughout the arrow-rust-data-ecosystem en par with S3 :).
Keep track of ADLS Gen2 support in the Azure Rust SDK here: https://github.com/Azure/azure-sdk-for-rust/issues/496
do we know if the atomic rename in gen2 api is a cheap pointer change or will actually require data copy behind the scene? If it's the latter, we might be better off using simple put if absent semantic instead. @blogle started a good discussion around this redesign in this slack thread for the GCS backend.
Yes, file rename is a cheap and atomic operation in ADLS Gen2.
We now have the required operation in the Azure Rust SDK that enables us to implement rename_obj_noreplace
in the delta-rs Azure Storage backend: https://github.com/Azure/azure-sdk-for-rust/pull/506
We should be able to implement put_obj
and delete_obj
using the Blob API, just like head_obj
, get_obj
, and list_objs
are implemented. Initially, only rename_obj_noreplace
will need to use the ADLS Gen2 API. To reiterate, we should be able to use both Blob API and ADLS Gen2 API in parallel. We can decide later if we want to exclusively use the ADLS Gen2 API (as of now, not all required operations are implemented there).
I also know #486 is ready to merge now, so we will be able to start working on supporting writes to Azure Storage soon.
Is this 3rd party Azure SDK implementation found here - https://crates.io/crates/azure_sdk_for_rust - the same one? It has crates already released.
I think these crates are actually the foundation for the development going on under the Azure github org right now. If you go to the repo, you'll see a notice that they migrated. Looking at the efforts in the new repo, they are working hard on stabilizing APIs and hopefully soon cutting a 0.1 release.
@roeap @houqp what do you think about the auth methods we should support?
After #486 we have:
- Static auth: shared key, shared access signature (SAS), and connection string (shared key)
- Auto-refreshing auth based on DefaultAzureCredential
None of these should be used in production applications in my opinion, for various reasons. The static methods present key rotation and authz challenges. DefaultAzureCredential is just a convenience abstraction with non-deterministic behavior.
I propose we use managed identity credentials (AMI). The downside is that they are hard to use during development, but we can support az cli
credentials for this and activate that path using a suitable switch in delta-rs, maybe an environment variable.
Making these work at runtime requires an identity in Azure (which the key based methods don't), but this is the right direction since it better supports key rotation and authz.
I realize we should probably align this with how we do things for AWS and GCP, but I wanted to present this unadulterated argument first.
Any thoughts?
Thanks for starting this discussion @thovoll . I think we should definitely have first class support for AMI due to production requirements. @thovoll does Azure sdk support the concept of credential chain? Basically the way it works for AWS is the client will automatically identify the best auth method based on provided environment variable and credential config file across different languages and platforms: see https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html and https://github.com/rusoto/rusoto/blob/master/AWS-CREDENTIALS.md. If we have something similar provided by the azure sdk out of the box, that would be ideal. If not, we can implement a temp workaround within delta-rs to simulate something similar. For example, go for AMI if it's available, if not try other routes that are easier to prepare for development environments.
The DefaultCredential we use is an credential chain. Only authorization via KEY or or SAS is not part of this chain. I am not sure about priorities in the chained credential, but assume this can be either configured or achieved with a simple custom credential.
Correct, default credential chain from the Azure Rust SDK can be configured but it's still rudimentary, also in terms of the methods it supports. There's no custom chain in the Azure Rust SDK yet but the Azure .Net SDK for example has more complete default and custom credential chains which will be implemented in the Azure Rust SDK at some point.
Regardless, my argument is that credential chains are convenient but shouldn't be used for production code. I'd rather explicitly state what credentials I want to use than falling through a hidden switch statement based on environmental conditions. The potential 'decoy' log entries alone are distracting.
Granted, if we want to support all the flavors it would be a little tedious to implement a deterministic approach - but I'd probably still prefer it. Since delta-rs is a library we should probably support all the credential types (but without using a chain).
The AWS docs that @houqp linked recommend using the default chain but also mention the other approaches:
- Use the default credential provider chain (recommended).
- Use a specific credential provider or provider chain (or create your own).
- Supply the credentials yourself. These can be root account credentials, IAM credentials, or temporary credentials retrieved from AWS STS.
The Azure docs at least hint at the tradeoff:
Note: DefaultAzureCredential is intended to simplify getting started with the SDK by handling common scenarios with reasonable default behaviors. Developers who want more control or whose scenario isn't served by the default settings should use other credential types.
What I'm arguing here is that we should want more control. An open question is "where is the right place?" Maybe delta-rs, maybe leave it up to consumers like kafka-delta-ingest.
As for a concrete long-term proposal (long-term because the Azure Rust SDK doesn't fully support this yet):
- Take
TokenCredential
as a parameter indelta-rs
and pass it to theDataLakeClient
(not possible yet but here is what that looks like in the Azure .Net SDK). - This still allows
delta-rs
users to specify a chain sinceDefaultAzureCredential
is aTokenCredential
. But it doesn't allowdelta-rs
users to use other methods like shared keys or connection strings (which the Azure SDK supports but we would be hiding fromdelta-rs
users, indicating our industrial strength intent). - Maybe we should support SAS but this method is intended for "limited delegated access" (potentially a scenario that
delta-rs
would be used in). - Also need to revisit how we can auto-refresh the credential.
- In
kafka-delta-ingest
, don't passDefaultAzureCredential
todelta-rs
but implement a more opinionated and deterministic approach. Only support AMI (for prod) and CLI (for dev), possibly controlled by a command line argument or environment variable.
Again, this is a long-term proposal but we can still discuss this in principle I think. If we agree, we can plot a path towards this future across all 3 projects (kafka-delta-ingest
, delta-rs
, and azure-sdk-for-rust
).
As for a concrete short-term proposal:
- Obtain a token (via
DefaultAzureCredential
) indelta-rs
- Pass the token to
DataLakeClient
(this is the only method supported by the Azure Rust SDK at the moment) - This is the quickest way to enable writes to Azure Storage.
- Improve from there, i.e. re-implement auto-refreshing credentials and everything else mentioned above over time.
I'll raise a draft PR for easier discussion.
Ha, I see where I am missing now. In AWS, it's common to use default credential chain in production, but looks like it's not the case in Azure. That said, for s3, we do have an option to let users explicitly provide a preconfigured client when creating the storage backend instance: https://github.com/delta-io/delta-rs/blob/cba4e3d316c56c65dc1cb660dc7e6d380ddd3a6b/rust/src/storage/s3/mod.rs#L413.
It's definitely doable and I'm sure many people use credential chains in Azure as well. Irrespective of how common it is, what do you think about the argument otherwise?
For S3, I think you meant this one? https://github.com/delta-io/delta-rs/blob/cba4e3d316c56c65dc1cb660dc7e6d380ddd3a6b/rust/src/storage/s3/mod.rs#L425-L429
I'll take a closer look and try to keep it close.
Do we support auto-refreshing credentials in S3 and GCP?
Yeah, sorry @thovoll , new_with
is the method that I was trying to use as an example. Auto refresh of credential should be supported by cloud SDK if possible.
The way we implemented for the Azure credentials is actually heavily inspired by the internals of the rusto s3 crate 🙂.
Does the AWS and GCP SDK support auto-refresh?
Yes, I remember the AWS SDK has an auto refreshing credential provider builtin.
I'm looking into how to best implement auto-refresh.
Ok, so auto-refresh in the Azure Rust SDK will take some time, filed an issue here: https://github.com/Azure/azure-sdk-for-rust/issues/543
In the meantime, we'll have to implement auto-refresh in delta-rs
.
@thovoll - this is done by now, right?