nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Allow identity based authentication on Azure Batch

Open abhi18av opened this issue 1 year ago • 11 comments

This PR addresses the credential-less authentication mechanism against Azure Batch and related services.

abhi18av avatar Aug 22 '22 08:08 abhi18av

@pditommaso , I'd appreciate your input on design decisions here.

I have been studying the authentication mechanisms which we can provide as an alternative to the use of storage/batch account keys, and I zeroed in on the service principal based setup, which essentially

  1. Requires the creation of a service principal
  2. Relies on authentication creds (client-id, client-secret, tenant-id)
  3. Can easily be integrated into the current codebase
  4. Minimal changes to an NF-generated pool

However, the downside is still that the authentication creds still would need to be available within the task environment for azcopy to login as a service principal via azcopy --service-principal -application-id $CLIENT_ID --tenant $TENANT_ID with AZCOPY_SPA_CLIENT_SECRET available in the env.

Therefore, the effective change via this mechanism is that we would authenticate the per-node/task azcopy CLI for uploads, without storage keys/SAS token but we'd still rely upon the authentication creds (client-id, client-secret, tenant-id) to be made available for every azcopy invocation.


However, I think the central piece which needs to be addressed is basically authenticating azcopy running per-node/task in the batch service and there are basically three modes of authentication which we can explore

  1. Service Principal
  2. User-assigned Identity
  3. Managed Identity

The user-assigned identity requires certain metadata to be made available to an Azure Resource such as client ID, Object ID, or Resource ID but beyond this, is effectively the same as Managed Identity (managed indicates, managed by Azure platform).

I'm currently working on the Managed Identity and evaluating any corner cases

  • Which would only require us to do an azcopy login --managed within .command.run
  • Requires certain properties to be enabled in the Batch account (such as Identity = SystemAssigned and Storage account authentication = "BatchAccountManagedIdentity"
  • Requires the Managed Identity to have sufficient permissions

Managed Identity can be used with all major services which makes sense in Nextflow (and Tower) context such as

  • Monitoring
  • Logging
  • Managed Disk
  • Azure ARC Kubernetes
  • Key vault

Despite the expectations with the Batch account setup, this direction via User-assigned / Managed Identity seems security friendly - what do you think?

abhi18av avatar Aug 24 '22 05:08 abhi18av

However, the downside is still that the authentication creds still would need to be available within the task environment for azcopy to login as a service principal via azcopy --service-principal -application-id $CLIENT_ID --tenant $TENANT_ID with AZCOPY_SPA_CLIENT_SECRET available in the env.

Ouch this is ugly. The problem is also the credentials for the Blob storage on Nextflow side which now requires the account key or SAS token

https://github.com/nextflow-io/nextflow/blob/6f087fecade0293978af096c465dce9ea1e75f84/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystemProvider.groovy#L202-L203

I think the best way to go for this would be to generate a SAS token when the pipeline is launched. This should also solve the problem for the azcopy copy tool. have you taken into consideration this possibility?

pditommaso avatar Aug 24 '22 08:08 pditommaso

I think the best way to go for this would be to generate a SAS token when the pipeline is launched. This should also solve the problem for the azcopy copy tool. have you taken into consideration this possibility?

Paolo, this is another possibility yes. The tradeoff would be that we get in the business of managing SAS tokens during a pipeline run cycle and the mechanism would not be suitable in case the SAS tokens are not enabled (which is a recommended best practice), therefore we might inevitably run into the problem of using credential-less validation.

I've had good success with the managed identity (credential less) solution which relies on the fact that any VM in the Batch node is an Azure resource and can natively authenticate with the Azure AD (if configured as such) - I feel I'm close 🤞 😊

abhi18av avatar Aug 24 '22 16:08 abhi18av

Bingo!

abhinav@0239xxxxxxx2684bb1xxxxxxxx000:/mnt/batch/tasks/shared/bin$ ./azcopy login --identity
INFO: azcopy: A newer version 10.16.0 is available to download

INFO: Login with identity succeeded.
INFO: azcopy: A newer version 10.16.0 is available to download

And azcopy login --identity is all we'd need to do in the .command.run script to allow any interaction with the Blob storage 🤝

abhi18av avatar Aug 24 '22 17:08 abhi18av

What about Blob storage client run by nextflow itself?

pditommaso avatar Aug 24 '22 17:08 pditommaso

What about Blob storage client run by nextflow itself?

Do you mean the head-job? If so, then we need to accommodate this library https://github.com/AzureAD/microsoft-authentication-library-for-java within the codebase to allow the head job to source the authentication from the VM it's running upon - I'm evaluating this now for our use cases.

abhi18av avatar Aug 24 '22 18:08 abhi18av

Yes, but it looks to me that it requires too much customization that's hard to test.

Above all, once Batch is authenticated, nextflow is able to automatically create a SAS token for the Blob storage.

https://github.com/nextflow-io/nextflow/blob/27636f9d27c56d286e1e6173e66d97c971c5d6d3/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchExecutor.groovy#L97-L98

Therefore keep the blob storage auth on the SAS token and focus on finalise the Batch authentication

pditommaso avatar Aug 29 '22 10:08 pditommaso

Ouch that was not accurate, the sas token is created by using the storage client

https://github.com/nextflow-io/nextflow/blob/27636f9d27c56d286e1e6173e66d97c971c5d6d3/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzHelper.groovy#L64-L64

that's create with the storage keys :/

https://github.com/nextflow-io/nextflow/blob/27636f9d27c56d286e1e6173e66d97c971c5d6d3/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystemProvider.groovy#L205-L208

pditommaso avatar Aug 29 '22 10:08 pditommaso

@pditommaso , yes that's why I'm trying find a user-friendly and credential-less auth way to allow users to run Nextflow

Case-1. Entirely within the Azure infra (by relying upon Identity) - this path is straightforward now. Case-2. Hybrid mode, with the head job not running on Azure infra ↔️ this is where I'm iterating a bit

UPDATE:

@pditommaso , how about the following scenarios regarding Case-2

  1. The service principal is used for authentication and initiating the local <-> Azure access

  2. If the batch pool is configured for managed-id, then rely upon the managed identity (and we don't rely upon any credentials there)

  3. Otherwise rely upon the service principal credential for the any authentication.

Managing and rotating the service principal and associate creds is much more scalable than relying upon the Batch/Storage keys.

Also, it is likely that a user who has to interact with the Blob service (via azcopy) would already have a service-principal due to the design of the tool for data transfers.

What do you think?

abhi18av avatar Aug 30 '22 06:08 abhi18av

@pditommaso @jorgeaguileraseqera , this PR is ready to be reviewed from my side - I'd appreciate any feedback please 🙏

  • Service principal should have the following permissions

    • Contributor
    • Storage Blob Data Contributor
    • Storage Blob Data Owner
  • Config level changes

    • azure.activeDirectory config
    • storage and batch accountKey are no longer required
workDir = "az://rnaseq-nf/rnaseq-nf-workdir"

process {
    executor = 'azurebatch'
    container = 'quay.io/nextflow/rnaseq-nf:latest'
}

params {
    outdir = "az://rnaseq-nf/rnaseq-nf-publishdir"
}

process.executor = 'azurebatch'
process.queue =  "$AZURE_BATCH_POOL_NAME"


azure {

    activeDirectory {
        servicePrincipalId = "bXXXbXXX-6XXc-X2xx-8xx4-f02e6xxxxxc" // OR "$AZURE_CLIENT_ID"
        servicePrincipalSecret = "BXXQ~UuxxxO-kXXXXX1yjG5agXXXXXLweXXXX" // OR "$AZURE_CLIENT_SECRET"
        tenantId = "XX51xxxx-XXXX-4XX8-XXXX-1XXXXXXeb" // OR "$AZURE_TENANT_ID"
    }

    storage {
        accountName = "$AZURE_STORAGE_ACCOUNT_NAME"
    }

    batch {
        location = "westeurope"
        accountName = "$AZURE_BATCH_ACOUNT_NAME"

        deleteJobsOnCompletion = false

        pools {
             "$AZURE_BATCH_POOL_NAME" {
                vmType = 'STANDARD_D4_V3'
                privileged = false
            }
        }
    }
}

  • Changes to log entries Logs regarding creation of Blob and Batch clients using Azure Active Directory Identity
...
Sept-14 07:52:53.852 [main] DEBUG n.c.azure.nio.AzFileSystemProvider - Creating Azure Blob storage client using Service Principal credentials

...
Sept-14 07:53:03.274 [Thread-5] INFO  c.a.identity.ClientSecretCredential - Azure Identity => getToken() result for scopes [https://storage.azure.com/.default]: SUCCESS
Sept-14 07:53:03.276 [Thread-5] INFO  c.a.c.i.AccessTokenCache - Acquired a new access token.
...
Sept-14 07:53:13.175 [Task submitter] DEBUG n.cloud.azure.batch.AzBatchService - [AZURE BATCH] Creating Azure Batch client using Service Principal credentials


Tested scenarios

Case-1: Same blob container for work/results ✅

Case-2: Different blob container for work/results ✅

Improvements (TBD)

  • Disable logging from the underlying library
Sept-14 07:53:14.694 [reactor-http-kqueue-3] WARN  c.a.c.u.s.SerializerEncoding - 'Content-Type' not found. Returning default encoding: JSON

abhi18av avatar Sep 14 '22 06:09 abhi18av

I don't know these APIs very well but I didn't see any issues with the design.

bentsherman avatar Sep 14 '22 15:09 bentsherman

@pditommaso , I've added the doc entries in the PR now.

abhi18av avatar Sep 27 '22 18:09 abhi18av

:warning: 7 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

sonatype-lift[bot] avatar Sep 27 '22 18:09 sonatype-lift[bot]

@pditommaso , shall we take this merge forward?

abhi18av avatar Oct 24 '22 10:10 abhi18av

Let's merge this 🚀

pditommaso avatar Nov 04 '22 09:11 pditommaso


via Justin on GIPHY

abhi18av avatar Nov 04 '22 13:11 abhi18av