great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

feature: support Azure AD token based authentication for azure storage account

Open geekwhocodes opened this issue 2 years ago • 12 comments

Is your feature request related to a problem? Please describe. Currently, GX supports string metadata in azure storage account and use connection string for authentication. It would be great if we support azure ad token based authentication since it is proffered method.

Describe the solution you'd like Implement another authentication provider where users can choose which authentication method to use.

Additional context If user is using azure ad token using SPN to access data from datalake on azure, it is cumbersome to maintain azure storage account connection strin to configure gx. It would be streamlined if we provide user an option to use same SPN.

geekwhocodes avatar Dec 08 '22 18:12 geekwhocodes

I would like to work on this.

geekwhocodes avatar Dec 08 '22 18:12 geekwhocodes

@geekwhocodes That's great. Please let us know how we can assist or if you get stuck.

rdodev avatar Dec 08 '22 18:12 rdodev

@rdodev Sure, I will come up with an approach and will post here if I need assistance. Can you assign this to me?

geekwhocodes avatar Dec 08 '22 18:12 geekwhocodes

@geekwhocodes sure. If not me, someone in the team will. We're happy to help and enable community contributions.

rdodev avatar Dec 08 '22 18:12 rdodev

https://github.com/great-expectations/great_expectations/blob/d553b82eacafe3c8619ef08acd8cbc61c2cee052/great_expectations/data_context/store/tuple_store_backend.py#L1018

DefaultAzureCredential this credential provider already supports different authentication methods. However, it expects values to be in environment variables. I will go ahead and change the method signature of this function https://github.com/great-expectations/great_expectations/blob/d553b82eacafe3c8619ef08acd8cbc61c2cee052/great_expectations/data_context/store/tuple_store_backend.py#L971 to accept following parameters -

  1. azure_ad_tenant_id
  2. azure_ad_client_id
  3. azure_ad_client_secret Instantiate credential provider with these parameters while keeping DefaultAzureCredential as a fallback.

geekwhocodes avatar Dec 15 '22 07:12 geekwhocodes

@geekwhocodes thanks for this and apologies for the delay.

@talagluck would you mind providing Ganesh with feedback about his suggested approach?

rdodev avatar Dec 19 '22 13:12 rdodev

@rdodev just following up on this.

cc: @Shinnnyshinshin

geekwhocodes avatar Jan 21 '23 16:01 geekwhocodes

@rdodev @Shinnnyshinshin I tried configuring SiteBuilder with the following configuration but it didn't work(as expected).

Actual error message TupleAzureBlobStoreBackend.__init__() got an unexpected keyword argument 'AZURE_TENANT_ID' image

I agree that with DefaultAzureCredential people can use other authentication methods but it'll be great if we allow users to provide the required config programmatically(while initializing the context).

Use case - we are using GX in databricks, we would like to use SPN authentication mechanism and pass config while initializing the GX context. If opt for azure key vault secret manager we cannot pass secrets via config object.

Proposed solution : Implement different authentication types where users can select and pass the required config parameters. example:

    "store_backend": {
        "class_name": "TupleAzureBlobStoreBackend",
        "container": "documents",
        "prefix": "gx/spn",
        "auth" :{
            "class_name": "AuthProvider"
            "param1": "",
            "param2":""
        }
    }

g-raskar avatar Feb 07 '23 08:02 g-raskar

Hi @geekwhocodes - thanks for this and apologies for delays.

Could you please open a PR, even just a basic draft with your proposed approach? That will enable us to get this looked at by the right team. Thanks!

talagluck avatar Mar 29 '23 10:03 talagluck

Thanks @talagluck for giving a go ahead. I'll work on the initial implementation.

geekwhocodes avatar Mar 31 '23 22:03 geekwhocodes

The reason why token based authentication is preffered over SAS based authentication is that connecting to ADLS gen2 with the access key completely overrides the ACLs set on the folders. Effectively granting access to anyone who runs the notebook to any data that is available on the account. However, defaultcredential will probably not work for people using azure synapse or fabric and i doubt it will work very well for most cases in databricks either. there is a reason (credential pass-through) databricks ships this class:

com.databricks.backend.daemon.data.client.adl.AdlGen2CredentialContextTokenProvider

wouldn't it be better to implement an TokenProvider base class that can be extended based on the context in which GE is run?

for the spark cdm sdk we implemented the following

class SynapseTokenProvider(TokenProvider):
    def __init__(self, sc: SparkSession, linkedService: str) -> None:
        super().__init__()
        self.sc = sc
        self.linkedService = linkedService

    def get_token(self) -> str:
        token_library = (
            self.sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
        )

        return f"Bearer {token_library.getConnectionString(self.linkedService)}"

the code above actually obtains a valid jwt token for the managed identity that is running the notebook. The tokenprovider base class is provided by the cdm sdk: https://github.com/microsoft/CDM/blob/master/objectModel/Python/cdm/utilities/network/token_provider.py

tests that give insight in how it's used: https://github.com/microsoft/CDM/blob/master/objectModel/Python/tests/storage/test_adls.py#L20

dgcaron avatar Sep 12 '23 08:09 dgcaron

so i tried using the configuration with the defaultcredential but that failed in azure synapse.What did work however is creating a mount with a linked service configured to use the MSI

mssparkutils.fs.mount( 
    f"abfss://mycontainer@{datalake}.dfs.core.windows.net", 
    "/mnt_ge", 
    {"LinkedService":"LS_ADLS"} 
) 

root = mssparkutils.fs.getMountPath("/mnt_ge")
# root will look like "/synfs/10/mnt_ge"

my store config then looks like this

 "az_checkpoint_store": {
                        "module_name": "great_expectations.data_context.store",
                        "class_name": "CheckpointStore",
                        "store_backend": {
                            "class_name": "TupleFilesystemStoreBackend",
                            "base_directory": f"{root}/ge/the/path/to/checkpoints",
                        },
                    },

in my case i had the storage account networking restricted so i had to add a private endpoint to the dfs AND the blob endpoints for this to work and don't forget to mount when the job is finished as the mount doesn't get removed automatically

dgcaron avatar Jan 08 '24 20:01 dgcaron