dbt-databricks
dbt-databricks copied to clipboard
Support for azure authentication mechanisms.
Describe the feature
Beyond simple PAT tokens supporting some of Azure AAD based authentication would great.
Additional context
dbt-sqlserver is a good example of how to get vaild auth tokens, and databricks-sql-connector already supports taking auth_token arguments.
Who will this benefit?
Users trying to use AAD based SSO or other features.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
I'm a big supported of this feature, and I'd love to help out, as I originally "assisted" with adding AAD auth to dbt-sqlserver. 👀: @JCZuurmond
Awesome to hear that Anders! At least for Databricks it would be pretty straight forward if we had an easy way to pull in all the Auth code from dbt-sqlserver. Whichever AAD auth mode the user selects eventually they get a token back, that token would just be used in place of the password with the Databricks connection and if the workspace is AAD enabled the connection will succeed.
in dbt-sqlserver, we're doing exactly what you say. basically use the Azure Python SDK's azure-identity package to get a token, then send it out when connecting with pyodbc. Here's the helper functions that do the work bulk of the work right now. There's some opportunity to clean this up, and potentially publish it as a single class in a standalone PyPI package that both dbt-sqlserver and dbt-databricks could make use of?
For more, info see our guide on how to authenticate using AAD and dbt.
The hardest pill for me to swallow is drawing a dependency on the Azure CLI. As Scott Henderson writes:
to use Azure, you typically need to use the az command line utility. This authenticates you to Azure, and allows you management access to practically all their APIs. It’s a wonderful, functional tool, written in Python and provided as open source.
However, if you want to install this command - beware! It’s an absolute monster, weighing in at over a gigabyte in its current incarnation. This problem has been known about since a bug was raised in 2018, and as a user back then it was nowhere near as bad - maybe a few hundred meg at that stage.
The root cause is the Azure Python APIs, which are horrendously bloated. Microsoft’s backward compatibility is legendary, of course, but what has happened in the Python API is that each incompatible change has caused an in-API code fork to occur - exploring the repo is an Inception-like experience, with each subdirectory looking much the same as the others, all alike.
Plus, in order to support these old APIs, Microsoft has taken to packaging an entire python runtime with the utility, to ensure it runs correctly. And all the python bytecode cache. The only thing missing is the kitchen sink.
@ueshin this issue is almost a year old now. It seems like the most bitter pill to swallow is the Azure CLI (yikes).
For those authenticating using Azure, you can use Az CLI to get a valid token and then use that in the regular config:
- Sign in with Az CLI:
az login - Fetch an access token for Databricks (the ID is static for all Databricks workspaces):
aad_token_response=$(az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d) - Parse the access token (you need jq installed):
aad_token=$(jq .accessToken -r <<< "$aad_token_response") - Store the token in an environment variable
export DATABRICKS_AAD_TOKEN=$aad_token
Then in your profiles.yml you can authenticate using the stored token: token: "{{ env_var('DATABRICKS_AAD_TOKEN') }}"
But as discussed above, it would be nicer to use the azure-identity package to retrieve a token automatically so that you can also use this in setups with managed identity, service principals etc. without having to use the CLI.
Support for Azure AD OAuth has been added on #327
Just a note on @sdebruyn's answer (very useful, thanks)
az account get-access-token --resource $DATABRICKS_RESOURCE_ID --query accessToken -otsv
also works, rather than hopping over to jq