GH-46833:[Python][Azure] Adding ConfigureClientSecretCredential to AzureFileSystem
Rationale for this change
ClientSecretCredential is another method to authenticate to Azure resources using Service Principle (SP). We work on AzureML, where there are two main types of compute ComputeInstance (personal VM) and ComputeCluster. In context of AzureML you can work interactively on ComputeInstance, where pyarrow defaults toDefaultAzureCredential when reading/writing data over abfss:// protocol. However when running AzureML "jobs" which is run non-interactively execution, the context (network) is different and DefaultAzureCredential doesn't work. An alternative method for non-interactive job is to use Service Principle i.e ClientSecretCredential. This way we can create AzureFileSystem object and give it pyarrow dataset
What changes are included in this PR?
Adding ConfigureClientSecretCredential to AzureFilesystem, which is already implemented at the C++, but hasn't been propagated to python library. This pull request just propagates C++ method to pyarrow
Are these changes tested?
Yes, test have been included in the relevant file
Are there any user-facing changes?
Docs of AzureFileSystem have been updated in the code. It would be amazing for those changes to surface on the website, which I'm happy to try to help with if needed
- GitHub Issue: #46833
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
@raulcd, that was it ! thanks for helping
What should I do next?
Thanks! great work!
What should I do next?
Can you update the title of the PR as suggested on this comment:
https://github.com/apache/arrow/pull/46837#issuecomment-2980414235
(basically add the original GH-46833: issue)
And if you could update the description to match the template:
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
I'll proceed to ping some other committers to review once this is done and CI has finished successfully.
will do, quick question. I've noticed that linting is failing the test. which linter should I use, nothing seems to be setup in the pyproject.toml. Just cython-lint ?
thanks
We use pre-commit. As the error suggests you can try:
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/python/pyarrow/_azurefs.pyx b/python/pyarrow/_azurefs.pyx
index d5d0660f7f..2131824b85 100644
--- a/python/pyarrow/_azurefs.pyx
+++ b/python/pyarrow/_azurefs.pyx
@@ -122,7 +122,8 @@ cdef class AzureFileSystem(FileSystem):
if (tenant_id or client_id or client_secret):
if not (tenant_id and client_id and client_secret):
- raise ValueError("All of tenant_id, client_id, and client_secret must be provided for ClientSecretCredential.")
+ raise ValueError(
+ "All of tenant_id, client_id, and client_secret must be provided for ClientSecretCredential.")
options.ConfigureClientSecretCredential(
tobytes(tenant_id), tobytes(client_id), tobytes(client_secret)
)
There seems to be some line length issues too:
Python Format............................................................Failed
- hook id: autopep8
- files were modified by this hook
Python Lint..............................................................Failed
- hook id: flake8
- exit code: 1
python/pyarrow/tests/test_fs.py:1488:89: E501 line too long (108 > 88 characters)
python/pyarrow/tests/test_fs.py:1495:89: E501 line too long (108 > 88 characters)
python/pyarrow/tests/test_fs.py:1502:89: E501 line too long (108 > 88 characters)
:warning: GitHub issue #46833 has been automatically assigned in GitHub to PR creator.
@raulcd I've updated PR description, let me know if more information is required. thanks
@raulcd all done, let me know if I need todo anything else.
p.s I figure out how to run pre-commit just for python part 😆
@raulcd I've changed if/else logic a little, just to have a more clear separation between ClientSecret and ManagedIdentity Azure credentials.
I've pushed it as a separate commit, so that it's easier to see, but let me know if you need all changes squashed as a single commit.
cheers
BTW no need to squash, we squash on merge with a single commit per PR, you can push as many commits as necessary on the PR. Thanks!
I think i have the logic you've described
If there's no client_id fail early without requiring more checks If there's no tenant and no client_secret --> ConfigureManagedIdentityCredential If there's tenant and client_secret --> ConfigureClientSecretCredential Otherwise fail
Let me know how it comes out
Hey @raulcd I can see the error due to docs string in AzureFileSystem class. I just wanted to clarify how you want it? If you want thing alphabetical, then I'll need to rearrange parameters in the constructor, or I can leave params untoched and make doc string this order ?
Thanks
Parameters
----------
account_name : str
Azure Blob Storage account name. This is the globally unique identifier for the
storage account.
account_key : str, default None
Account key of the storage account. If sas_token and account_key are None the
default credential will be used. The parameters account_key and sas_token are
mutually exclusive.
blob_storage_authority : str, default None
hostname[:port] of the Blob Service. Defaults to `.blob.core.windows.net`. Useful
for connecting to a local emulator, like Azurite.
dfs_storage_authority : str, default None
hostname[:port] of the Data Lake Gen 2 Service. Defaults to
`.dfs.core.windows.net`. Useful for connecting to a local emulator, like Azurite.
blob_storage_scheme : str, default None
Either `http` or `https`. Defaults to `https`. Useful for connecting to a local
emulator, like Azurite.
dfs_storage_scheme : str, default None
Either `http` or `https`. Defaults to `https`. Useful for connecting to a local
emulator, like Azurite.
sas_token : str, default None
SAS token for the storage account, used as an alternative to account_key. If sas_token
and account_key are None the default credential will be used. The parameters
account_key and sas_token are mutually exclusive.
tenant_id : str, default None
Tenant ID for Azure Active Directory authentication. Must be provided together with
`client_id` and `client_secret` to use ClientSecretCredential.
client_id : str, default None
The client ID (Application ID) for Azure Active Directory authentication.
Its interpretation depends on the credential type being used:
- For `ClientSecretCredential`: It is the Application (client) ID of your
registered Azure AD application (Service Principal). It must be provided
together with `tenant_id` and `client_secret` to use ClientSecretCredential.
- For `ManagedIdentityCredential`: It is the client ID of a specific
user-assigned managed identity. This is only necessary if you are using a
user-assigned managed identity and need to explicitly specify which one
(e.g., if the resource has multiple user-assigned identities). For
system-assigned managed identities, this parameter is typically not required.
client_secret : str, default None
Client secret for Azure Active Directory authentication. Must be provided together
with `tenant_id` and `client_id` to use ClientSecretCredential.
As the ones failing are key-only arguments and this won't change the API, I would update the order of the parameters on the __init__ function to match the ones on the docstring. Basically keeping alphabetical order for consistency as the driving decision (this is what we try to do in general unless there are API breaking issues)
This is exciting ! thanks so much team 🙌
Thanks for your work on the PR!
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 953947a78157090ff94b14bc027cbcf520a5b2a6.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.