great_expectations
great_expectations copied to clipboard
feature: support Azure AD token based authentication for azure storage account
Is your feature request related to a problem? Please describe.
Currently, GX supports string metadata in azure storage account and use connection string
for authentication. It would be great if we support azure ad token
based authentication since it is proffered method.
Describe the solution you'd like Implement another authentication provider where users can choose which authentication method to use.
Additional context If user is using azure ad token using SPN to access data from datalake on azure, it is cumbersome to maintain azure storage account connection strin to configure gx. It would be streamlined if we provide user an option to use same SPN.
I would like to work on this.
@geekwhocodes That's great. Please let us know how we can assist or if you get stuck.
@rdodev Sure, I will come up with an approach and will post here if I need assistance. Can you assign this to me?
@geekwhocodes sure. If not me, someone in the team will. We're happy to help and enable community contributions.
https://github.com/great-expectations/great_expectations/blob/d553b82eacafe3c8619ef08acd8cbc61c2cee052/great_expectations/data_context/store/tuple_store_backend.py#L1018
DefaultAzureCredential
this credential provider already supports different authentication methods. However, it expects values to be in environment variables. I will go ahead and change the method signature of this function
https://github.com/great-expectations/great_expectations/blob/d553b82eacafe3c8619ef08acd8cbc61c2cee052/great_expectations/data_context/store/tuple_store_backend.py#L971 to accept following parameters -
- azure_ad_tenant_id
- azure_ad_client_id
- azure_ad_client_secret
Instantiate credential provider with these parameters while keeping
DefaultAzureCredential
as a fallback.
@geekwhocodes thanks for this and apologies for the delay.
@talagluck would you mind providing Ganesh with feedback about his suggested approach?
@rdodev just following up on this.
cc: @Shinnnyshinshin
@rdodev @Shinnnyshinshin
I tried configuring SiteBuilder
with the following configuration but it didn't work(as expected).
Actual error message
TupleAzureBlobStoreBackend.__init__() got an unexpected keyword argument 'AZURE_TENANT_ID'
I agree that with DefaultAzureCredential
people can use other authentication methods but it'll be great if we allow users to provide the required config programmatically(while initializing the context).
Use case - we are using GX in databricks, we would like to use SPN authentication mechanism and pass config while initializing the GX context. If opt for azure key vault secret manager
we cannot pass secrets via config object.
Proposed solution : Implement different authentication types where users can select and pass the required config parameters. example:
"store_backend": {
"class_name": "TupleAzureBlobStoreBackend",
"container": "documents",
"prefix": "gx/spn",
"auth" :{
"class_name": "AuthProvider"
"param1": "",
"param2":""
}
}
Hi @geekwhocodes - thanks for this and apologies for delays.
Could you please open a PR, even just a basic draft with your proposed approach? That will enable us to get this looked at by the right team. Thanks!
Thanks @talagluck for giving a go ahead. I'll work on the initial implementation.
The reason why token based authentication is preffered over SAS based authentication is that connecting to ADLS gen2 with the access key completely overrides the ACLs set on the folders. Effectively granting access to anyone who runs the notebook to any data that is available on the account. However, defaultcredential will probably not work for people using azure synapse or fabric and i doubt it will work very well for most cases in databricks either. there is a reason (credential pass-through) databricks ships this class:
com.databricks.backend.daemon.data.client.adl.AdlGen2CredentialContextTokenProvider
wouldn't it be better to implement an TokenProvider base class that can be extended based on the context in which GE is run?
for the spark cdm sdk we implemented the following
class SynapseTokenProvider(TokenProvider):
def __init__(self, sc: SparkSession, linkedService: str) -> None:
super().__init__()
self.sc = sc
self.linkedService = linkedService
def get_token(self) -> str:
token_library = (
self.sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
)
return f"Bearer {token_library.getConnectionString(self.linkedService)}"
the code above actually obtains a valid jwt token for the managed identity that is running the notebook. The tokenprovider base class is provided by the cdm sdk: https://github.com/microsoft/CDM/blob/master/objectModel/Python/cdm/utilities/network/token_provider.py
tests that give insight in how it's used: https://github.com/microsoft/CDM/blob/master/objectModel/Python/tests/storage/test_adls.py#L20
so i tried using the configuration with the defaultcredential but that failed in azure synapse.What did work however is creating a mount with a linked service configured to use the MSI
mssparkutils.fs.mount(
f"abfss://mycontainer@{datalake}.dfs.core.windows.net",
"/mnt_ge",
{"LinkedService":"LS_ADLS"}
)
root = mssparkutils.fs.getMountPath("/mnt_ge")
# root will look like "/synfs/10/mnt_ge"
my store config then looks like this
"az_checkpoint_store": {
"module_name": "great_expectations.data_context.store",
"class_name": "CheckpointStore",
"store_backend": {
"class_name": "TupleFilesystemStoreBackend",
"base_directory": f"{root}/ge/the/path/to/checkpoints",
},
},
in my case i had the storage account networking restricted so i had to add a private endpoint to the dfs AND the blob endpoints for this to work and don't forget to mount when the job is finished as the mount doesn't get removed automatically