duckdb_azure icon indicating copy to clipboard operation
duckdb_azure copied to clipboard

Support for specifying token directly

Open aersam opened this issue 2 years ago • 17 comments

Hi there

Thanks for this cool extension, that will enable lot's of use cases for us

If you acquire the token outside duckdb, would be nice to be able to do something like this:

SET azure_storage_bearer_token = '<your_token>';

This is espescially useful if you use Managed Identity / Interactive Browser Credentials or the like

aersam avatar Nov 06 '23 08:11 aersam

Hi @aersam thanks for reporting, there are some changes coming up to how duckdb manages credentials, when that gets merged, I will look into adding this to it

samansmink avatar Nov 06 '23 14:11 samansmink

It will be nice to have it , OneLake which is based on Azure uses token by default today DuckDB can't use it directly:(

djouallah avatar Jan 28 '24 02:01 djouallah

Hello,

Just wondering if the issue is still open? Now that the extension is capable of handling some credentials types. If yes would you mind explaining a bit the workflow? I do not understand the idea of the bearer token (I mean you will have to renew it manually each time it expires) no?

quentingodeau avatar Feb 23 '24 07:02 quentingodeau

Yes you have to renew it manually. Main use case is if you have a token in Python or so and want to use it, e.g. you could have a token from a user context in a python backend and want to pass that. In such cases the lifetime is not an issue, your Library in python would be doing that and just before executing something you would be updating the duckdb variable

aersam avatar Feb 23 '24 09:02 aersam

Ok, one more question the token come from a SPN, a manged id, a workload identity or env variable, no? Why not pass this information to duck as a secret and let it get a new token for you? (I can take a look to implement your request I think that it not very complex but I wonder if that a common use case or a really specific one)

quentingodeau avatar Feb 23 '24 20:02 quentingodeau

Yea i agree with @quentingodeau, the implementation would be something along the lines of:

class RawTokenCredential : public Azure::Core::Credentials::TokenCredential {
public:
	RawTokenCredential(const string& token_name) : Azure::Core::Credentials::TokenCredential(token_name) {
	}
	Azure::Core::Credentials::AccessToken GetToken(
	    Azure::Core::Credentials::TokenRequestContext const& tokenRequestContext,
	    Azure::Core::Context const& context) const override {
	    return raw_token;
	};
	Azure::Core::Credentials::AccessToken raw_token;
};

But it is a little hacky and probably not desirable if one of the other credentials provider methods can be used. Note that the Azure SDK does not provide this RawTokenCredential, so to me that feels like a hint that this is not a common path

samansmink avatar Feb 26 '24 10:02 samansmink

Not very common, but sometimes required. I'd say it's just the more low-level approach for advanced use cases

aersam avatar Feb 26 '24 18:02 aersam

Also there are so many ways to use Microsoft's Entra ID that I don't think you want to handle every edge case

aersam avatar Feb 26 '24 18:02 aersam

it is common, for example today, I can't write to Fabric OneLake using DuckDB

djouallah avatar Feb 26 '24 22:02 djouallah

@djouallah do you known how Fabric authenticate ? Does it use app registration ?

quentingodeau avatar Feb 28 '24 12:02 quentingodeau

yes https://stackoverflow.com/questions/76794202/authentication-not-granted-for-service-principal-token-in-ms-fabric-api-using-py

djouallah avatar Feb 28 '24 22:02 djouallah

Just chiming in here, this is also standard usage at our company. Basically we do something analogous to DeviceCodeCredential and then store the results in a custom class. The code is very similar to what samansmink suggested above, except it also keeps the refresh_token and refreshes the access token whenever needed.

The goal is to authenticate with a username/password, without having to either re-authenticate constantly or having to store username/password somewhere. Creating a service principal or managed identity per user is too difficult to manage/govern.

I'm not up to the task of writing it in duck/c++ myself, we previously used python and adlfs to authenticate this way. But if Ican help with anything e.g., testing, I'd be happy to do so.

j-r77 avatar Apr 16 '24 12:04 j-r77

Sorry I have been away a bit. I will try to see if I can find a way to automated some testing on this. But just for info I may unprioritized this PR to add first the write capacity first.

quentingodeau avatar Apr 16 '24 22:04 quentingodeau

Sorry I have been away a bit. I will try to see if I can find a way to automated some testing on this. But just for info I may unprioritized this PR to add first the write capacity first.

Ok, but good that it's still on the radar. I'm missing support for user-assigned managed identities in duckdb currently, which I could workaround with the direct token support

aersam avatar Jul 03 '24 06:07 aersam

It looks like that could be a small change, so likely something I could contribute a PR for.

In my case I hit a couple of issues with the current auth setup in the extension:

  • Azure Synapse uses a non-standard way to get access tokens: code needs to call mssparkutils.credentials.getToken("Storage").
  • In #63, one possibility for auth failure is that Az CLI gives tokens for the wrong user identity (I use multiple user identities on my machine)

Those feel like a long-tail of edge cases so likely not something worth having built-in support for but something which would be nice to unblock by allowing custom access-token generation.

Re: 'that feels like a hint that this is not a common path' -- in my experience it is actually fairly common to derive custom classes from TokenCredential to abstract away non-standard auth mechanisms from the Azure SDK. For instance, in Python, auth on Azure Synapse can be done like this:

from azure.core.credentials import AccessToken, TokenCredential

class StorageCredential(TokenCredential):
    def get_token(self, *scopes: str, claims: Optional[str] = None, tenant_id: Optional[str] = None, **kwargs: Any) -> AccessToken:
        return AccessToken(mssparkutils.credentials.getToken("Storage"), sys.maxsize)

Couple of potential issues:

  • Ideally tokens would be refreshed to avoid auth failures when tokens expire. This is typically achieved through some form of callback. Not sure if this is feasible in duckdb. Alternative might be for the caller to update the token on a timer.
  • Token expiration needs to be provided in Azure::Core::Credentials::AccessToken via the ExpiresOn field. One option could be to parse that from the exp claim in access tokens. Another could be to have the client provide that.

Is there a preference on how to solve those?

mmaitre314 avatar Jul 10 '24 12:07 mmaitre314

@mmaitre314 we currently don't have a mechanism in duckdb to handle token expiry (yet) so that would probably be a place to start on this.

Otherwise I think we can just add this and document the fact that manual secret refreshing is required. That way this can work as a workaround until we have proper secret expiration

samansmink avatar Jul 10 '24 13:07 samansmink

One workaround which works with the extension as-is, albeit a convoluted one:

  • Start with an Entra access token (from device code, managed identity, etc.)
  • Exchange it for a user-delegation Storage key (similar to regular Storage keys, but tied to Entra auth and temporary)
  • Generate a user-delegation SAS from the key
  • Wrap the SAS in a connection string
  • Set the connection string as DuckDB secret

User-delegation keys/SAS can live for up-to 7 days and it looks like DuckDB allows refreshing them using CREATE OR REPLACE SECRET.

Python sample code using a mix of Managed Identity and Interactive Browser credentials:

import duckdb
from datetime import datetime, timezone, timedelta
from azure.identity import ChainedTokenCredential, ManagedIdentityCredential, InteractiveBrowserCredential
from azure.storage.blob import BlobServiceClient, generate_container_sas

tenant_id='11111111-2222-3333-4444-555555555555'
account_name = "myaccount"
container_name = "mycontainer"
blob_path = "path/to/blobs/*.parquet"

credential = ChainedTokenCredential(ManagedIdentityCredential(), InteractiveBrowserCredential(tenant_id=tenant_id))

def create_user_delegation_sas() -> str:

    start_time = datetime.now(timezone.utc)
    expiry_time = start_time + timedelta(days=1)

    client = BlobServiceClient(f"https://{account_name}.blob.core.windows.net", credential=credential)

    return generate_container_sas(
        account_name = account_name,
        container_name = container_name,
        user_delegation_key = client.get_user_delegation_key(key_start_time=start_time, key_expiry_time=expiry_time),
        resource_types = "sco",
        permission = "rl",
        start = start_time,
        expiry = expiry_time,
    )

duckdb.sql(f"""
    CREATE OR REPLACE SECRET {account_name} (
        TYPE AZURE,
        CONNECTION_STRING 'DefaultEndpointsProtocol=https;AccountName={account_name};EndpointSuffix=core.windows.net;SharedAccessSignature={create_user_delegation_sas()}',
        SCOPE 'az://{account_name}.blob.core.windows.net/'
    )
    """)

duckdb.sql(f"SELECT COUNT(*) FROM 'az://{account_name}.blob.core.windows.net/{container_name}/{blob_path}'")

mmaitre314 avatar Jul 11 '24 14:07 mmaitre314