arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-46833:[Python][Azure] Adding ConfigureClientSecretCredential to AzureFileSystem

Open KirillTsyganov opened this issue 6 months ago • 8 comments

Rationale for this change

ClientSecretCredential is another method to authenticate to Azure resources using Service Principle (SP). We work on AzureML, where there are two main types of compute ComputeInstance (personal VM) and ComputeCluster. In context of AzureML you can work interactively on ComputeInstance, where pyarrow defaults toDefaultAzureCredential when reading/writing data over abfss:// protocol. However when running AzureML "jobs" which is run non-interactively execution, the context (network) is different and DefaultAzureCredential doesn't work. An alternative method for non-interactive job is to use Service Principle i.e ClientSecretCredential. This way we can create AzureFileSystem object and give it pyarrow dataset

What changes are included in this PR?

Adding ConfigureClientSecretCredential to AzureFilesystem, which is already implemented at the C++, but hasn't been propagated to python library. This pull request just propagates C++ method to pyarrow

Are these changes tested?

Yes, test have been included in the relevant file

Are there any user-facing changes?

Docs of AzureFileSystem have been updated in the code. It would be amazing for those changes to surface on the website, which I'm happy to try to help with if needed

  • GitHub Issue: #46833

KirillTsyganov avatar Jun 17 '25 13:06 KirillTsyganov

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Jun 17 '25 13:06 github-actions[bot]

@raulcd, that was it ! thanks for helping

What should I do next?

KirillTsyganov avatar Jun 18 '25 07:06 KirillTsyganov

Thanks! great work!

What should I do next?

Can you update the title of the PR as suggested on this comment:

https://github.com/apache/arrow/pull/46837#issuecomment-2980414235

(basically add the original GH-46833: issue)

And if you could update the description to match the template:

Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?

I'll proceed to ping some other committers to review once this is done and CI has finished successfully.

raulcd avatar Jun 18 '25 07:06 raulcd

will do, quick question. I've noticed that linting is failing the test. which linter should I use, nothing seems to be setup in the pyproject.toml. Just cython-lint ?

thanks

KirillTsyganov avatar Jun 18 '25 07:06 KirillTsyganov

We use pre-commit. As the error suggests you can try:

If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
 diff --git a/python/pyarrow/_azurefs.pyx b/python/pyarrow/_azurefs.pyx
index d5d0660f7f..2131824b85 100644
--- a/python/pyarrow/_azurefs.pyx
+++ b/python/pyarrow/_azurefs.pyx
@@ -122,7 +122,8 @@ cdef class AzureFileSystem(FileSystem):
 
         if (tenant_id or client_id or client_secret):
             if not (tenant_id and client_id and client_secret):
-                raise ValueError("All of tenant_id, client_id, and client_secret must be provided for ClientSecretCredential.")
+                raise ValueError(
+                    "All of tenant_id, client_id, and client_secret must be provided for ClientSecretCredential.")
             options.ConfigureClientSecretCredential(
                 tobytes(tenant_id), tobytes(client_id), tobytes(client_secret)
             )

There seems to be some line length issues too:

 Python Format............................................................Failed
- hook id: autopep8
- files were modified by this hook
Python Lint..............................................................Failed
- hook id: flake8
- exit code: 1

python/pyarrow/tests/test_fs.py:1488:89: E501 line too long (108 > 88 characters)
python/pyarrow/tests/test_fs.py:1495:89: E501 line too long (108 > 88 characters)
python/pyarrow/tests/test_fs.py:1502:89: E501 line too long (108 > 88 characters)

raulcd avatar Jun 18 '25 08:06 raulcd

:warning: GitHub issue #46833 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jun 18 '25 08:06 github-actions[bot]

@raulcd I've updated PR description, let me know if more information is required. thanks

KirillTsyganov avatar Jun 18 '25 08:06 KirillTsyganov

@raulcd all done, let me know if I need todo anything else.

p.s I figure out how to run pre-commit just for python part 😆

Screen Shot 2025-06-25 at 11 30 47 pm

KirillTsyganov avatar Jun 25 '25 13:06 KirillTsyganov

@raulcd I've changed if/else logic a little, just to have a more clear separation between ClientSecret and ManagedIdentity Azure credentials.

I've pushed it as a separate commit, so that it's easier to see, but let me know if you need all changes squashed as a single commit.

cheers

KirillTsyganov avatar Jun 26 '25 00:06 KirillTsyganov

BTW no need to squash, we squash on merge with a single commit per PR, you can push as many commits as necessary on the PR. Thanks!

raulcd avatar Jun 26 '25 06:06 raulcd

I think i have the logic you've described

If there's no client_id fail early without requiring more checks If there's no tenant and no client_secret --> ConfigureManagedIdentityCredential If there's tenant and client_secret --> ConfigureClientSecretCredential Otherwise fail

Let me know how it comes out

KirillTsyganov avatar Jun 26 '25 07:06 KirillTsyganov

Hey @raulcd I can see the error due to docs string in AzureFileSystem class. I just wanted to clarify how you want it? If you want thing alphabetical, then I'll need to rearrange parameters in the constructor, or I can leave params untoched and make doc string this order ?

Thanks

    Parameters
    ----------
    account_name : str
        Azure Blob Storage account name. This is the globally unique identifier for the 
        storage account.
    account_key : str, default None
        Account key of the storage account. If sas_token and account_key are None the 
        default credential will be used. The parameters account_key and sas_token are
        mutually exclusive.
    blob_storage_authority : str, default None
        hostname[:port] of the Blob Service. Defaults to `.blob.core.windows.net`. Useful
        for connecting to a local emulator, like Azurite.
    dfs_storage_authority : str, default None
        hostname[:port] of the Data Lake Gen 2 Service. Defaults to
        `.dfs.core.windows.net`. Useful for connecting to a local emulator, like Azurite.
    blob_storage_scheme : str, default None
        Either `http` or `https`. Defaults to `https`. Useful for connecting to a local 
        emulator, like Azurite.
    dfs_storage_scheme : str, default None
        Either `http` or `https`. Defaults to `https`. Useful for connecting to a local
        emulator, like Azurite.
    sas_token : str, default None
        SAS token for the storage account, used as an alternative to account_key. If sas_token
        and account_key are None the default credential will be used. The parameters
        account_key and sas_token are mutually exclusive.
    tenant_id : str, default None
        Tenant ID for Azure Active Directory authentication. Must be provided together with
        `client_id` and `client_secret` to use ClientSecretCredential.
    client_id : str, default None
        The client ID (Application ID) for Azure Active Directory authentication.
        Its interpretation depends on the credential type being used:
        - For `ClientSecretCredential`: It is the Application (client) ID of your
          registered Azure AD application (Service Principal). It must be provided
          together with `tenant_id` and `client_secret` to use ClientSecretCredential.
        - For `ManagedIdentityCredential`: It is the client ID of a specific
          user-assigned managed identity. This is only necessary if you are using a
          user-assigned managed identity and need to explicitly specify which one
          (e.g., if the resource has multiple user-assigned identities). For
          system-assigned managed identities, this parameter is typically not required.
    client_secret : str, default None
        Client secret for Azure Active Directory authentication. Must be provided together
        with `tenant_id` and `client_id` to use ClientSecretCredential.

KirillTsyganov avatar Jun 27 '25 10:06 KirillTsyganov

As the ones failing are key-only arguments and this won't change the API, I would update the order of the parameters on the __init__ function to match the ones on the docstring. Basically keeping alphabetical order for consistency as the driving decision (this is what we try to do in general unless there are API breaking issues)

raulcd avatar Jun 27 '25 11:06 raulcd

This is exciting ! thanks so much team 🙌

KirillTsyganov avatar Jun 30 '25 07:06 KirillTsyganov

Thanks for your work on the PR!

raulcd avatar Jun 30 '25 08:06 raulcd

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 953947a78157090ff94b14bc027cbcf520a5b2a6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.