delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Support writes to Azure Storage

Open thovoll opened this issue 3 years ago • 26 comments

Description

The delta-rs Azure Storage backend currently doesn't support write operations: https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L1 As a result, Delta Tables can not be written to Azure Storage.

Several methods need to be implemented: https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L192 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L196 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L200

The delta-rs Azure Storage backend currently uses the Blob API to access the storage account (ContainerClient), which does not support atomic rename (rename_obj_noreplace). We need to use the ADLS Gen2 API, which does support atomic rename.

Both APIs can be used in parallel with ADLS Gen2 storage accounts, but we may want to simply switch all operations over to the ADLS Gen2 API unless there is an advantage to using the Blob API for some operations.

The other operations required by the delta-rs Azure Storage backend are: https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L122 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L141 https://github.com/delta-io/delta-rs/blob/50a53cad084f2aa65ed0d6c0f1f5ad620a785530/rust/src/storage/azure.rs#L157

For atomic rename to work, the actual Azure Storage Account resource being used must have hierarchical namespaces enabled, which is what makes it an "ADLS Gen2" storage account.

Apologies for any remaining confusion, this is a bit tricky to explain.

Also, the Azure Rust SDK doesn't yet support write operations via the ADLS Gen API, but work is underway.

Use Case Write Delta Tables to Azure Storage

Related Issue(s) None

thovoll avatar Nov 05 '21 14:11 thovoll

I am considering looking into this once (if :)) #486 gets merged. MY assumption would be that delete_obj and put_obj could be implemented without the need for hierarchical namespaces enabled. However I am still a bit confused what constitutes gen2... As far as I understand, the *.dfs.core.windows.net endpoint is available on all recent (v2) storage accounts, if HNS is enabled, it becomes ADLS Gen 2, and the file_system APIs / directories become available? If HNS is a requirement, likely #60 should then be addressed as well somewhere.

roeap avatar Nov 09 '21 12:11 roeap

Thanks @roeap, that would be awesome. Your assumptions above are all correct. Only HNS supports atomic renames, which will obviate the need for the lock service that we have to use for S3.

thovoll avatar Nov 09 '21 14:11 thovoll

@roeap some things to be aware of:

  • The Azure Rust SDK doesn't have any crate releases so can't be included in a delta-rs crate release yet (hopefully soon)
  • The Azure Rust SDK doesn't have the ADLS Gen2 atomic rename operation yet, but I've been working up to adding it. I just added a bunch of operations and rename_file_if_not_exists is next.

Great to have a collaborator in making delta-rs work with Azure!

thovoll avatar Nov 09 '21 14:11 thovoll

PR for #60 is #492

thovoll avatar Nov 09 '21 15:11 thovoll

@thovoll, thanks for the pointers, and excited to see that the azure-sdk-for-rust repo seems to see some increased activity recently - I'm guessing pushing for a release of the crates... hopefully we can bring the azure support throughout the arrow-rust-data-ecosystem en par with S3 :).

roeap avatar Nov 09 '21 15:11 roeap

Keep track of ADLS Gen2 support in the Azure Rust SDK here: https://github.com/Azure/azure-sdk-for-rust/issues/496

thovoll avatar Nov 10 '21 04:11 thovoll

do we know if the atomic rename in gen2 api is a cheap pointer change or will actually require data copy behind the scene? If it's the latter, we might be better off using simple put if absent semantic instead. @blogle started a good discussion around this redesign in this slack thread for the GCS backend.

houqp avatar Nov 15 '21 07:11 houqp

Yes, file rename is a cheap and atomic operation in ADLS Gen2.

thovoll avatar Nov 15 '21 14:11 thovoll

We now have the required operation in the Azure Rust SDK that enables us to implement rename_obj_noreplace in the delta-rs Azure Storage backend: https://github.com/Azure/azure-sdk-for-rust/pull/506

We should be able to implement put_obj and delete_obj using the Blob API, just like head_obj, get_obj, and list_objs are implemented. Initially, only rename_obj_noreplace will need to use the ADLS Gen2 API. To reiterate, we should be able to use both Blob API and ADLS Gen2 API in parallel. We can decide later if we want to exclusively use the ADLS Gen2 API (as of now, not all required operations are implemented there).

I also know #486 is ready to merge now, so we will be able to start working on supporting writes to Azure Storage soon.

thovoll avatar Nov 15 '21 14:11 thovoll

Is this 3rd party Azure SDK implementation found here - https://crates.io/crates/azure_sdk_for_rust - the same one? It has crates already released.

andrei-ionescu avatar Nov 30 '21 23:11 andrei-ionescu

I think these crates are actually the foundation for the development going on under the Azure github org right now. If you go to the repo, you'll see a notice that they migrated. Looking at the efforts in the new repo, they are working hard on stabilizing APIs and hopefully soon cutting a 0.1 release.

roeap avatar Dec 01 '21 10:12 roeap

@roeap @houqp what do you think about the auth methods we should support?

After #486 we have:

  • Static auth: shared key, shared access signature (SAS), and connection string (shared key)
  • Auto-refreshing auth based on DefaultAzureCredential

None of these should be used in production applications in my opinion, for various reasons. The static methods present key rotation and authz challenges. DefaultAzureCredential is just a convenience abstraction with non-deterministic behavior.

I propose we use managed identity credentials (AMI). The downside is that they are hard to use during development, but we can support az cli credentials for this and activate that path using a suitable switch in delta-rs, maybe an environment variable.

Making these work at runtime requires an identity in Azure (which the key based methods don't), but this is the right direction since it better supports key rotation and authz.

I realize we should probably align this with how we do things for AWS and GCP, but I wanted to present this unadulterated argument first.

Any thoughts?

thovoll avatar Dec 08 '21 20:12 thovoll

Thanks for starting this discussion @thovoll . I think we should definitely have first class support for AMI due to production requirements. @thovoll does Azure sdk support the concept of credential chain? Basically the way it works for AWS is the client will automatically identify the best auth method based on provided environment variable and credential config file across different languages and platforms: see https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html and https://github.com/rusoto/rusoto/blob/master/AWS-CREDENTIALS.md. If we have something similar provided by the azure sdk out of the box, that would be ideal. If not, we can implement a temp workaround within delta-rs to simulate something similar. For example, go for AMI if it's available, if not try other routes that are easier to prepare for development environments.

houqp avatar Dec 09 '21 03:12 houqp

The DefaultCredential we use is an credential chain. Only authorization via KEY or or SAS is not part of this chain. I am not sure about priorities in the chained credential, but assume this can be either configured or achieved with a simple custom credential.

roeap avatar Dec 09 '21 08:12 roeap

Correct, default credential chain from the Azure Rust SDK can be configured but it's still rudimentary, also in terms of the methods it supports. There's no custom chain in the Azure Rust SDK yet but the Azure .Net SDK for example has more complete default and custom credential chains which will be implemented in the Azure Rust SDK at some point.

Regardless, my argument is that credential chains are convenient but shouldn't be used for production code. I'd rather explicitly state what credentials I want to use than falling through a hidden switch statement based on environmental conditions. The potential 'decoy' log entries alone are distracting.

Granted, if we want to support all the flavors it would be a little tedious to implement a deterministic approach - but I'd probably still prefer it. Since delta-rs is a library we should probably support all the credential types (but without using a chain).

The AWS docs that @houqp linked recommend using the default chain but also mention the other approaches:

  • Use the default credential provider chain (recommended).
  • Use a specific credential provider or provider chain (or create your own).
  • Supply the credentials yourself. These can be root account credentials, IAM credentials, or temporary credentials retrieved from AWS STS.

The Azure docs at least hint at the tradeoff:

Note: DefaultAzureCredential is intended to simplify getting started with the SDK by handling common scenarios with reasonable default behaviors. Developers who want more control or whose scenario isn't served by the default settings should use other credential types.

What I'm arguing here is that we should want more control. An open question is "where is the right place?" Maybe delta-rs, maybe leave it up to consumers like kafka-delta-ingest.

thovoll avatar Dec 09 '21 14:12 thovoll

As for a concrete long-term proposal (long-term because the Azure Rust SDK doesn't fully support this yet):

  • Take TokenCredential as a parameter in delta-rs and pass it to the DataLakeClient (not possible yet but here is what that looks like in the Azure .Net SDK).
  • This still allows delta-rs users to specify a chain since DefaultAzureCredential is a TokenCredential. But it doesn't allow delta-rs users to use other methods like shared keys or connection strings (which the Azure SDK supports but we would be hiding from delta-rs users, indicating our industrial strength intent).
  • Maybe we should support SAS but this method is intended for "limited delegated access" (potentially a scenario that delta-rs would be used in).
  • Also need to revisit how we can auto-refresh the credential.
  • In kafka-delta-ingest, don't pass DefaultAzureCredential to delta-rs but implement a more opinionated and deterministic approach. Only support AMI (for prod) and CLI (for dev), possibly controlled by a command line argument or environment variable.

Again, this is a long-term proposal but we can still discuss this in principle I think. If we agree, we can plot a path towards this future across all 3 projects (kafka-delta-ingest, delta-rs, and azure-sdk-for-rust).

thovoll avatar Dec 09 '21 15:12 thovoll

As for a concrete short-term proposal:

  • Obtain a token (via DefaultAzureCredential) in delta-rs
  • Pass the token to DataLakeClient (this is the only method supported by the Azure Rust SDK at the moment)
  • This is the quickest way to enable writes to Azure Storage.
  • Improve from there, i.e. re-implement auto-refreshing credentials and everything else mentioned above over time.

I'll raise a draft PR for easier discussion.

thovoll avatar Dec 09 '21 15:12 thovoll

Ha, I see where I am missing now. In AWS, it's common to use default credential chain in production, but looks like it's not the case in Azure. That said, for s3, we do have an option to let users explicitly provide a preconfigured client when creating the storage backend instance: https://github.com/delta-io/delta-rs/blob/cba4e3d316c56c65dc1cb660dc7e6d380ddd3a6b/rust/src/storage/s3/mod.rs#L413.

houqp avatar Dec 09 '21 18:12 houqp

It's definitely doable and I'm sure many people use credential chains in Azure as well. Irrespective of how common it is, what do you think about the argument otherwise?

For S3, I think you meant this one? https://github.com/delta-io/delta-rs/blob/cba4e3d316c56c65dc1cb660dc7e6d380ddd3a6b/rust/src/storage/s3/mod.rs#L425-L429

I'll take a closer look and try to keep it close.

Do we support auto-refreshing credentials in S3 and GCP?

thovoll avatar Dec 09 '21 22:12 thovoll

Yeah, sorry @thovoll , new_with is the method that I was trying to use as an example. Auto refresh of credential should be supported by cloud SDK if possible.

houqp avatar Dec 10 '21 06:12 houqp

The way we implemented for the Azure credentials is actually heavily inspired by the internals of the rusto s3 crate 🙂.

roeap avatar Dec 10 '21 07:12 roeap

Does the AWS and GCP SDK support auto-refresh?

thovoll avatar Dec 10 '21 17:12 thovoll

Yes, I remember the AWS SDK has an auto refreshing credential provider builtin.

houqp avatar Dec 11 '21 06:12 houqp

I'm looking into how to best implement auto-refresh.

thovoll avatar Dec 13 '21 15:12 thovoll

Ok, so auto-refresh in the Azure Rust SDK will take some time, filed an issue here: https://github.com/Azure/azure-sdk-for-rust/issues/543

In the meantime, we'll have to implement auto-refresh in delta-rs.

thovoll avatar Dec 16 '21 18:12 thovoll

@thovoll - this is done by now, right?

roeap avatar May 15 '22 17:05 roeap