delta-rs
delta-rs copied to clipboard
Azure Data Lake Storage (ADLS2) Support for Service Principal Auth
Description
Using the delta-rs python connector (latest version, 0.5.7), it seems not possible to authenticate using Service Principals (client_id and client_secret). In any case, both AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY are required to be set. If not provided, the authentication is not falling back on the DefaultAzureCrendential (See related issues) but gives the following error.
Code executed:
storage_options = {
'account_name': os.getenv('AZURE_STORAGE_ACCOUNT_NAME'),
'client_id': os.getenv('AZURE_CLIENT_ID'),
'client_secret': os.getenv('AZURE_CLIENT_SECRET')
}
delta = DeltaTable(table_uri=f"adls2://{storage_options.get('account_name')}/curated/" + dataset_path, storage_options=storage_options)
dataFrames = delta.to_pyarrow_table().to_pandas()
Stacktrace:
File "C:/Users/tahitimath/workspace/dataapi/src/main.py", line 22, in books_table_update
result = DeltaReader.get_delta_table(dataset.dataset_path, dataset.engine)
File "C:\Users\user\workspace\dataapi\src\databricks\delta_reader.py", line 91, in get_delta_table
storage_options=storage_options)
File "C:\Users\tahitimath\.conda\envs\py37_api\lib\site-packages\deltalake\table.py", line 91, in __init__
table_uri, version=version, storage_options=storage_options
**deltalake.PyDeltaTableError: Failed to read delta log object: Azure config error: AZURE_STORAGE_ACCOUNT_KEY must be set**
Would it be possible to clarify in the doc how to get alternative Azure authentication working ?
hi @tahiti-math - due to how the azure sdk evolved, service principal auth was intermittently not available. This is also the case for the current python release. recently we re-enabled service principal credentials but no python release has been done since then. Are you able to build from current master to give it a try?
The next release will have that capability added agian - good point to update the docs though ...
@roeap I have built and it seems to be working. However, I see this error on azure_core.
When is the next python released planned ?
@tahiti-math - looking at the URL it seems you are running the azurite storage emulator, is that correct?
If so, the emulator does not support the Gen2 APIs required for the deltalake crate to work. Why that would translate to a 404 - not sure, but I would not expect that to work.
As for the next planned release date - @houqp @wjones127 @fvaleye do you have an opinion on that?
@roeap running on an actual ADLSv2 on Azure, not using any storage emulator. Adding the print of the stacktrace in the rust function gives me the following:
FYI, the path does exists and data is returned correclty.
@houqp @wjones127 @fvaleye @roeap Any news on the 0.5.8 python release?
I'd like to finish #625 in the next few days (I think I'm almost done), and then we should plan to release.
@tahiti-math - i may have discovered a bug in the azure sdk, that we do not hit in our test cases. Need to investigate a bit more and submit a PR to the azure repo, but hopefully this will be resolved before the next release.
Is this linked to the comment I posted earlier with "PathNotFound" ?
@tahiti-math - the new python bindings (0.5.8
) are released now. could you check if those work for you?
I confirm SP principal auth is now back. However, still having this ERROR log message (that seems to not impact the result)
Sorry for the long silence ... we have been through a bit of testing around our storages - can you build of main and see if theat works.
I tested locally, and was able to connect with a service principal, but the table setup is far likely less complex, since i used more or less a hello world table..
@roeap I have tried to use delta-rs python connector (latest version, 0.5.8), to access an Azure Data Lake Storage (ADLS2) while using a Service Principal from an Azure Machine Learning Compute Instance. Unfortunately the storage_options does not seem to recognize the account_name. My code looks as follows:
storage_options = {
'account_name': '<my_storage_account>',
'client_id': '<service_principal_id>',
'client_secret': '<service_principal_secret>'
}
delta = DeltaTable(table_uri="adls2://<my_storage_account>/raw/<path_to_delta_table>",
storage_options=storage_options)
and I get the following error message:
File "/azureml-envs/azureml_de79154eb031e1b0e163402d84a7cc57/lib/python3.8/site-packages/deltalake/table.py", line 90, in __init__
self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Azure config error: account name must be provided
I tried a few things in terms of the syntax, but I cannot get it to work. I can access the storage account via a other options for example as a registered Datastore with the same authentication. I also managed to use the data lake previously with a different delta lake library. Could you take a look and see if the errors is on my side or if something does not work as intended?
I have found my error, which was not adding the authentication variables as environment variables. It is now working, even though I get the same error message as @tahiti-math which does not seem to impact the results.
I have tested this auth method and is failing again: @tahiti-math @NilsHahnBayer @roeap @wjones127
Could you share some code example?