nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Not authenticated to use blobs outside of Azure blob container working directory when using Azure Entra

Open adamrtalbot opened this issue 1 year ago • 16 comments

Related to #5448 and #5444 but both issues refer to using Fusion, this one refers to using azcopy.

They are likely to be solved by the same method, since they have the same underlying challenge: how to pass authentication to the worker node (Batch) from Nextflow.

I seem to be able to recreate the issue without Fusion.

> nextflow run seqeralabs/nf-canary -r main --remoteFile az://igenomes/atacseq_samplesheet_custom.csv --run TEST_STAGE_REMOTE -w az://scidev-useast -c azure.config
N E X T F L O W  ~  version 24.10.3
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [c818260035]
Launching `https://github.com/seqeralabs/nf-canary` [magical_noyce] DSL2 - revision: 2ad4214f51 [main]
Uploading local `bin` scripts folder to az://scidev-useast/tmp/cf/bcc6a54f6a9dd33780a5251d956439/bin
[69/6f65a5] Submitted process > NF_CANARY:TEST_STAGE_REMOTE (1)
ERROR ~ Error executing process > 'NF_CANARY:TEST_STAGE_REMOTE (1)'

Caused by:
  Process `NF_CANARY:TEST_STAGE_REMOTE (1)` terminated with an error exit status (1)


Command executed:

  cat atacseq_samplesheet_custom.csv

Command exit status:
  1

Command output:
  (empty)

Work dir:
  az://scidev-useast/69/6f65a5549f7a3b2357312b12a28996

Container:
  docker.io/library/ubuntu:23.10

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit

azure.config:

process.executor = 'azurebatch'

fusion {
    enabled = false
}

azure {

    storage {
        accountName = 'seqeralabs'
    }

    batch {
        location = 'eastus'
        accountName = 'seqeralabs'
        copyToolInstallMode = 'node'
        autoPoolMode = true
        allowPoolCreation = true
        deletePoolsOnCompletion = false
    }

    activeDirectory {
        servicePrincipalId = 'redacted'
        servicePrincipalSecret = 'redacted'
        tenantId = 'redacted'
    }
}

And with an access key:

To reiterate what's been said above, the error appears to stem from generateContainerSasWithActiveDirectory, which is only generating a relevant key for the working container and nothing else. Generating an account level SAS seems tricky (according to @alberto-miranda).

Originally posted by @adamrtalbot in https://github.com/nextflow-io/nextflow/issues/5444#issuecomment-2590156438

adamrtalbot avatar Jan 14 '25 15:01 adamrtalbot

@alberto-miranda here is a method we could tell nextflow to pass the details to the worker task, this could help with #5444 and #5448.

It's pretty crude right now.

adamrtalbot avatar Jan 14 '25 16:01 adamrtalbot

@alberto-miranda here is a method we could tell nextflow to pass the details to the worker task, this could help with #5444 and #5448.

It's pretty crude right now.

Apologies for the delay, but it is great that we are finally moving forward with this 😄. I'm happy to support for this in the Fusion side of things, so let's sync!

alberto-miranda avatar Feb 04 '25 05:02 alberto-miranda

I wrote a couple PRs to support authenticating with Managed Identities with fusion v2.4 and the upcoming v2.5 and system-wide Managed Identities work out of the box (user-assigned require a single environment variable to be injected into worker nodes). So, we should be set if we can make Nextflow:

  1. Automatically assign a system-wide MI to pool nodes; or
  2. Automatically assign a user-assigned MI to pool nodes and inject an environment variable

(I personally prefer option 1)

alberto-miranda avatar Feb 05 '25 15:02 alberto-miranda

Likely both should be supported

pditommaso avatar Feb 05 '25 15:02 pditommaso

I'm not sure option 1 is supported by Azure Batch.

Option 2 is implemented as https://github.com/nextflow-io/nextflow/pull/5670

adamrtalbot avatar Feb 05 '25 19:02 adamrtalbot

I think 1 means, nextflow creates a MI automatically and use it if it's not provided by the user

pditommaso avatar Feb 06 '25 13:02 pditommaso

I interpreted it as system-assigned vs user-assigned: https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview#managed-identity-types

@alberto-miranda is this what you meant?

Fusion should support both anyway.

adamrtalbot avatar Feb 06 '25 14:02 adamrtalbot

I interpreted it as system-assigned vs user-assigned: https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview#managed-identity-types

@alberto-miranda is this what you meant?

Yeah exactly, the naming is fairly misleading. The major difference for us is that for a System-wide MI we don't need any extra info in the worker nodes, whereas for a User-assigned MI we would need a Client ID or Resource ID to validate against.

As far as I understand both types can be configured by users in Azure and, if they do so, Nextflow already has a mechanism to let them choose one or the other in nextflow.config (https://www.nextflow.io/docs/latest/azure.html#managed-identities). The only piece missing, I believe, would be to propagate this information to worker nodes (which is covered by Adam's effort in #5670).

It's a slightly different story if we want Nextflow to automatically create these MIs and attach them to nodes from a pool: in this case users would not provide anything in nextflow.config (besides maybe their wish to use MIs) and Nextflow would take care of everything behind the scenes.

Fusion should support both anyway.

Fusion will be ready to support both as soon as https://github.com/seqeralabs/fusion/pull/716 and https://github.com/seqeralabs/fusion/pull/718 are merged.

alberto-miranda avatar Feb 06 '25 15:02 alberto-miranda

The only piece missing, I believe, would be to propagate this information to worker nodes (which is covered by Adam's effort in https://github.com/nextflow-io/nextflow/pull/5670).

The worker nodes do not support system assigned identity, just user assigned. I believe @swampie couldn't work out how to attach a managed identity to a node pool programatically :(

adamrtalbot avatar Feb 06 '25 21:02 adamrtalbot

Gpt:

Azure Batch worker nodes can utilize managed identities, but they support only user-assigned managed identities, not system-assigned ones. This means you need to create a user-assigned managed identity and associate it with your Batch pool to enable your compute nodes to securely access other Azure resources without managing credentials.

The system-assigned managed identity created for a Batch account is intended solely for accessing Azure Key Vault for customer-managed keys and is not available on compute nodes

pditommaso avatar Feb 06 '25 21:02 pditommaso

Gpt:

Azure Batch worker nodes can utilize managed identities, but they support only user-assigned managed identities, not system-assigned ones. This means you need to create a user-assigned managed identity and associate it with your Batch pool to enable your compute nodes to securely access other Azure resources without managing credentials.

The system-assigned managed identity created for a Batch account is intended solely for accessing Azure Key Vault for customer-managed keys and is not available on compute nodes

From the Java SDK for Azure it appears to be possible to create a User-assigned MI with new UserAssignedIdentities() and assign it to a pool with BatchAccount.DefinitionStages.WithIdentity.

EDIT: I asked Claude.ai and the response I got was similar (not verified/ not tested):

@Grab(group='com.microsoft.azure', module='azure-batch', version='9.0.0')
@Grab(group='com.microsoft.azure', module='azure-identity', version='1.2.5')
@Grab(group='com.microsoft.azure', module='azure-core', version='1.14.0')

import com.microsoft.azure.batch.*
import com.microsoft.azure.batch.auth.*
import com.microsoft.azure.batch.protocol.models.*
import com.azure.core.management.profile.AzureProfile
import com.azure.identity.DefaultAzureCredentialBuilder

def configureBatchPoolWithManagedIdentity() {
    // Azure Batch account details
    def batchAccountName = "your-batch-account"
    def batchAccountKey = "your-batch-account-key"
    def batchAccountUrl = "https://${batchAccountName}.${region}.batch.azure.com"
    
    // Create batch credentials
    def credentials = new BatchSharedKeyCredentials(
        batchAccountUrl,
        batchAccountName,
        batchAccountKey
    )
    
    // Create batch client
    def batchClient = BatchClient.open(credentials)
    
    // User-assigned managed identity details
    def userAssignedIdentityId = "/subscriptions/<subscription-id>/resourcegroups/<resource-group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<identity-name>"
    
    // Create identity reference
    def identityReference = new UserAssignedIdentities()
    identityReference.resourceId = userAssignedIdentityId
    
    // Create pool identity configuration
    def poolIdentityConfig = new PoolIdentityConfiguration()
        .withType(PoolIdentityType.USER_ASSIGNED)
        .withUserAssignedIdentities([
            (userAssignedIdentityId): identityReference
        ])
    
    // Create pool specification
    def poolSpec = new PoolAddParameter()
        .withId("pool-with-managed-identity")
        .withVmSize("Standard_D2s_v3")
        .withTargetDedicatedNodes(2)
        .withIdentity(poolIdentityConfig)
        // Configure the pool's virtual machine configuration
        .withVirtualMachineConfiguration(
            new VirtualMachineConfiguration()
                .withImageReference(
                    new ImageReference()
                        .withPublisher("microsoft-azure-batch")
                        .withOffer("ubuntu-server-container")
                        .withSku("20-04-lts")
                        .withVersion("latest")
                )
                .withNodeAgentSkuId("batch.node.ubuntu 20.04")
        )
    
    try {
        // Create the pool
        batchClient.poolOperations().createPool(poolSpec)
        println "Successfully created pool with managed identity"
    } catch (BatchErrorException e) {
        println "Error creating pool: ${e.getMessage()}"
    } finally {
        batchClient.close()
    }
}

// Execute the configuration
configureBatchPoolWithManagedIdentity()

alberto-miranda avatar Feb 07 '25 08:02 alberto-miranda

Resolved by #6118

bentsherman avatar Jun 16 '25 17:06 bentsherman

@bentsherman I've tested it without Fusion using latest Edge version, and it still doesn't seem to work. Moreover, it still seems to require azure.storage.accountKey when I use it like so:

azure {
  managedIdentity {
    clientId = azure_config["userAssignedManagedIdentityClientId"]
  }
  storage {
    accountName = azure_config["storageAccountName"]
  }
  batch {
    location = 'eastus'
    accountName = azure_config["batchAccountName"]
    poolIdentityClientId = azure_config["userAssignedManagedIdentityClientId"]

    allowPoolCreation = true
    deleteJobsOnCompletion = true
    copyToolInstallMode = 'node'
    pools {
      test_private_mount {
        vmType = 'Standard_D2as_v4'
        virtualNetwork = azure_config["virtualNetwork"]
      }
    }
  }
}

Looking at the code in #6118 , it seems like that applies to Fusion (https://github.com/nextflow-io/nextflow/blob/1a4c3987a2b09e02bedaa1b7da1d80f65efcaaea/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L561).

Given that, was this closed incorrectly perhaps? Thank you

dinvlad avatar Jun 20 '25 19:06 dinvlad

Correct, this is Fusion only so this issue shouldn't be closed.

adamrtalbot avatar Jun 23 '25 09:06 adamrtalbot

Hi,

I wanted to point out that this issue will be more prevalent now that you need to use ENTRA because Low Priority VMs are going to be deprecated and you need to switch to User Subscription Batch Accounts.

We have been using keys until now and we have been able to use multiple containers, but now Nextflow creates a SAS key for the working directory and try to use this for all the containers... which does not work.

Anybody that needs to migrate will have this issue.

https://learn.microsoft.com/en-us/azure/batch/batch-spot-vms

luanjot avatar Aug 18 '25 14:08 luanjot

@luanjot yes, this has impacted us after the move over. Our IT security policy means we are not able to use storage account access keys.

mark-liddell avatar Dec 01 '25 10:12 mark-liddell