duckdb_azure
duckdb_azure copied to clipboard
Support for specifying token directly
Hi there
Thanks for this cool extension, that will enable lot's of use cases for us
If you acquire the token outside duckdb, would be nice to be able to do something like this:
SET azure_storage_bearer_token = '<your_token>';
This is espescially useful if you use Managed Identity / Interactive Browser Credentials or the like
Hi @aersam thanks for reporting, there are some changes coming up to how duckdb manages credentials, when that gets merged, I will look into adding this to it
It will be nice to have it , OneLake which is based on Azure uses token by default today DuckDB can't use it directly:(
Hello,
Just wondering if the issue is still open? Now that the extension is capable of handling some credentials types. If yes would you mind explaining a bit the workflow? I do not understand the idea of the bearer token (I mean you will have to renew it manually each time it expires) no?
Yes you have to renew it manually. Main use case is if you have a token in Python or so and want to use it, e.g. you could have a token from a user context in a python backend and want to pass that. In such cases the lifetime is not an issue, your Library in python would be doing that and just before executing something you would be updating the duckdb variable
Ok, one more question the token come from a SPN, a manged id, a workload identity or env variable, no? Why not pass this information to duck as a secret and let it get a new token for you? (I can take a look to implement your request I think that it not very complex but I wonder if that a common use case or a really specific one)
Yea i agree with @quentingodeau, the implementation would be something along the lines of:
class RawTokenCredential : public Azure::Core::Credentials::TokenCredential {
public:
RawTokenCredential(const string& token_name) : Azure::Core::Credentials::TokenCredential(token_name) {
}
Azure::Core::Credentials::AccessToken GetToken(
Azure::Core::Credentials::TokenRequestContext const& tokenRequestContext,
Azure::Core::Context const& context) const override {
return raw_token;
};
Azure::Core::Credentials::AccessToken raw_token;
};
But it is a little hacky and probably not desirable if one of the other credentials provider methods can be used. Note that the Azure SDK does not provide this RawTokenCredential, so to me that feels like a hint that this is not a common path
Not very common, but sometimes required. I'd say it's just the more low-level approach for advanced use cases
Also there are so many ways to use Microsoft's Entra ID that I don't think you want to handle every edge case
it is common, for example today, I can't write to Fabric OneLake using DuckDB
@djouallah do you known how Fabric authenticate ? Does it use app registration ?
yes https://stackoverflow.com/questions/76794202/authentication-not-granted-for-service-principal-token-in-ms-fabric-api-using-py
Just chiming in here, this is also standard usage at our company. Basically we do something analogous to DeviceCodeCredential and then store the results in a custom class. The code is very similar to what samansmink suggested above, except it also keeps the refresh_token and refreshes the access token whenever needed.
The goal is to authenticate with a username/password, without having to either re-authenticate constantly or having to store username/password somewhere. Creating a service principal or managed identity per user is too difficult to manage/govern.
I'm not up to the task of writing it in duck/c++ myself, we previously used python and adlfs to authenticate this way. But if Ican help with anything e.g., testing, I'd be happy to do so.
Sorry I have been away a bit. I will try to see if I can find a way to automated some testing on this. But just for info I may unprioritized this PR to add first the write capacity first.
Sorry I have been away a bit. I will try to see if I can find a way to automated some testing on this. But just for info I may unprioritized this PR to add first the write capacity first.
Ok, but good that it's still on the radar. I'm missing support for user-assigned managed identities in duckdb currently, which I could workaround with the direct token support
It looks like that could be a small change, so likely something I could contribute a PR for.
In my case I hit a couple of issues with the current auth setup in the extension:
- Azure Synapse uses a non-standard way to get access tokens: code needs to call
mssparkutils.credentials.getToken("Storage"). - In #63, one possibility for auth failure is that Az CLI gives tokens for the wrong user identity (I use multiple user identities on my machine)
Those feel like a long-tail of edge cases so likely not something worth having built-in support for but something which would be nice to unblock by allowing custom access-token generation.
Re: 'that feels like a hint that this is not a common path' -- in my experience it is actually fairly common to derive custom classes from TokenCredential to abstract away non-standard auth mechanisms from the Azure SDK. For instance, in Python, auth on Azure Synapse can be done like this:
from azure.core.credentials import AccessToken, TokenCredential
class StorageCredential(TokenCredential):
def get_token(self, *scopes: str, claims: Optional[str] = None, tenant_id: Optional[str] = None, **kwargs: Any) -> AccessToken:
return AccessToken(mssparkutils.credentials.getToken("Storage"), sys.maxsize)
Couple of potential issues:
- Ideally tokens would be refreshed to avoid auth failures when tokens expire. This is typically achieved through some form of callback. Not sure if this is feasible in duckdb. Alternative might be for the caller to update the token on a timer.
- Token expiration needs to be provided in
Azure::Core::Credentials::AccessTokenvia theExpiresOnfield. One option could be to parse that from theexpclaim in access tokens. Another could be to have the client provide that.
Is there a preference on how to solve those?
@mmaitre314 we currently don't have a mechanism in duckdb to handle token expiry (yet) so that would probably be a place to start on this.
Otherwise I think we can just add this and document the fact that manual secret refreshing is required. That way this can work as a workaround until we have proper secret expiration
One workaround which works with the extension as-is, albeit a convoluted one:
- Start with an Entra access token (from device code, managed identity, etc.)
- Exchange it for a user-delegation Storage key (similar to regular Storage keys, but tied to Entra auth and temporary)
- Generate a user-delegation SAS from the key
- Wrap the SAS in a connection string
- Set the connection string as DuckDB secret
User-delegation keys/SAS can live for up-to 7 days and it looks like DuckDB allows refreshing them using CREATE OR REPLACE SECRET.
Python sample code using a mix of Managed Identity and Interactive Browser credentials:
import duckdb
from datetime import datetime, timezone, timedelta
from azure.identity import ChainedTokenCredential, ManagedIdentityCredential, InteractiveBrowserCredential
from azure.storage.blob import BlobServiceClient, generate_container_sas
tenant_id='11111111-2222-3333-4444-555555555555'
account_name = "myaccount"
container_name = "mycontainer"
blob_path = "path/to/blobs/*.parquet"
credential = ChainedTokenCredential(ManagedIdentityCredential(), InteractiveBrowserCredential(tenant_id=tenant_id))
def create_user_delegation_sas() -> str:
start_time = datetime.now(timezone.utc)
expiry_time = start_time + timedelta(days=1)
client = BlobServiceClient(f"https://{account_name}.blob.core.windows.net", credential=credential)
return generate_container_sas(
account_name = account_name,
container_name = container_name,
user_delegation_key = client.get_user_delegation_key(key_start_time=start_time, key_expiry_time=expiry_time),
resource_types = "sco",
permission = "rl",
start = start_time,
expiry = expiry_time,
)
duckdb.sql(f"""
CREATE OR REPLACE SECRET {account_name} (
TYPE AZURE,
CONNECTION_STRING 'DefaultEndpointsProtocol=https;AccountName={account_name};EndpointSuffix=core.windows.net;SharedAccessSignature={create_user_delegation_sas()}',
SCOPE 'az://{account_name}.blob.core.windows.net/'
)
""")
duckdb.sql(f"SELECT COUNT(*) FROM 'az://{account_name}.blob.core.windows.net/{container_name}/{blob_path}'")