bicep icon indicating copy to clipboard operation
bicep copied to clipboard

Add "wait" and "retry" deployment options

Open rshariy opened this issue 5 years ago • 93 comments

ARM template deployment often fails with errors like:

"Another operation is in progress on the selected item. If there is an in-progress operation, please retry after it has finished."

"BMSUserErrorObjectLocked","message":"Another operation is in progress on the selected item."

Just to clarity - this is not a dependency issue. ARM deployment may fail if ,for example, you try to add a VM to an RSV and there is another VM being added at the same time: for a few seconds RSV will not accept new clients and as the result your deployment will fail.

Would like to have an option to pause deployment and/or retry it - may be introduce the "wait" and "retry" deployment conditions, i.e:

resource blob 'Microsoft.Storage/storageAccounts/blobServices/containers@2019-06-01' = {
    wait: 30
    retry: 5
    name: '${stg.name}/default/logs'
}

rshariy avatar Nov 26 '20 01:11 rshariy

Understood. This is something we have been considering, but haven't scheduled the work yet. If you (or others) have other examples that you have run into, it would be great to capture those here.

I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.

alex-frankel avatar Nov 30 '20 21:11 alex-frankel

I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.

@alex-frankel I'm assuming this is something we're planning on also addressing in the underlying platform? This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.

anthony-c-martin avatar Dec 02 '20 17:12 anthony-c-martin

This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.

Agreed. @bmoore-msft and I were also discussing this yesterday. Ideally, ARM will co-locate all the calls end-to-end so a user never has to think about this. Not sure if/when that will be possible, and this may be a necessary evil in the meantime.

alex-frankel avatar Dec 02 '20 17:12 alex-frankel

The OP doesn't sound like replication (feels like concurrency) though I could see that you could potentially address both with something like retry. The problem in this case (or either really case) is indefinite postponement. This feels like a problem with the RP - common operations returning frequent 400s instead of maybe 429.

The challenge with this workaround is not only does the user have to fail, then implement a non-deterministic work around (that's expensive on the service) it will mask problems with across ARM, RPs and user code.

@rshariy - have you raised this issue with the RSV team? It doesn't appear to be an uncommon problem and seems like it should be addressed by the RSV... either it shouldn't happen or we're not helping customer figure out how to effectively use RSV.

bmoore-msft avatar Dec 02 '20 22:12 bmoore-msft

@bmoore-msft I raised a similar issue with the Azure Firewall product team about a year ago - the only solution we found is to use a PowerShell function to check Azure FW status (make sure it is not "updating") before kicking-off new ARM deployment to FW.

Just logged ticket 120120226003381 about the RSV issue - lets see what MS support will come up with.

rshariy avatar Dec 02 '20 22:12 rshariy

it will mask problems with across ARM, RPs and user code.

this point is what gives us caution on implementing something like this. We have some potential solutions to deal with the replication delay in particular that we will explore before introducing a wait.

@rshariy - please let us know the resolution of the case.

alex-frankel avatar Dec 03 '20 00:12 alex-frankel

I have a main template that looks like this:

module kv 'keyvault.bicep' = {
  name: 'kvSmoketestDeploy'
  scope: rg
  params: {
    keyVaultName: keyVaultName
    enableSoftDelete: false
  }
}

module kvaccpol 'keyvaultaccesspolicy.bicep' = {
  name: 'kvAccPolSmoketestDeploy'
  scope: rg
  params: {
    keyVaultName: keyVaultName
    action: 'add'
    objectId: objectId
    access: keyVaultAccessPolicyAccess
  }
}

When that runs, the deployment breaks with:

{
   "error": {
     "code": "ParentResourceNotFound",
     "message": "Can not perform requested operation on nested resource. Parent resource 'kv-kvaccpoltest' not found."
   }
} (Code:NotFound)

Running the deployment again, deploys the policy

Agazoth avatar Mar 31 '21 05:03 Agazoth

I ran into a scenario where I'd like a wait, not much code to show, basically deploying a FunctionApp, then want to output the default key for use in Api Management. The problem is the function app takes some time to spin up before the app keys are present...

resource functionApp 'Microsoft.Web/sites@2020-06-01' = {
  name: functionAppName
  location: location
  kind: 'functionapp'
...

output functionappdefaultkey string = listKeys('${functionApp.id}/host/default', functionApp.apiVersion).functionKeys.default

Workaround is to run the initial deployment of the function app twice.

eja-git avatar Apr 14 '21 21:04 eja-git

@eja-git this isn't a "wait" scenario, it's bug in the deployment engine job scheduling... the listKeys job is scheduled too early... so that's the fix for your particular scenario.

bmoore-msft avatar Apr 19 '21 15:04 bmoore-msft

Hi,

I've logged the following issue https://github.com/projectkudu/kudu/issues/3312#issuecomment-870741730 that could also benefit from the wait option during a deployment.

Best Regards Pieter

Pietervanhove avatar Jul 01 '21 11:07 Pietervanhove

I am trying to simplify firewall rule collection deploying by using loadTextContent and then loop from each variable. workload-x.json contains all properties for rule collection.

var workloads = [
  json(loadTextContent('./workload-1.json'))
  json(loadTextContent('./workload-2.json'))
  json(loadTextContent('./workload-3.json'))
]

resource afwPolicy 'Microsoft.Network/firewallPolicies@2021-02-01' existing = {
  name: 'bicepRules'
}

resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
  name: workload.name
  parent: afwPolicy
  properties: workload.properties
}]

here is the error I get

Rule Collection Group workload-2 can not be updated because Parent Firewall Policy bicepRules is in Updating state from previous operation

I am sure that a short delay between deployments would help us to loop through all array

azMantas avatar Oct 01 '21 09:10 azMantas

Only one Rule Collection Group can be updated at a time with Azure Firewall Policy. Since the update refreshes all of the connected Azure Firewall instances, the amount of time it takes to update is non-deterministic. Therefore you will need to serialize the deployment using the batchSize decorator.

Can you try:

@batchSize(1)
resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
  name: workload.name
  parent: afwPolicy
  properties: workload.properties
}]

SenthuranSivananthan avatar Oct 01 '21 10:10 SenthuranSivananthan

I have two scenarios that come to mind from recent experience.

Overarching enterprise management level policy being applied to a resource that has been created which I reference in next resource/module causing the Another Operation error. A retry would be useful here as I have no control or influence over the Policies.

I have also faced situations where a newly created resource is not available when referenced immediately afterwards which I assume is a replication/caching issue as the next run works flawlessly.

SQLDBAWithABeard avatar Oct 01 '21 11:10 SQLDBAWithABeard

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes. In this case I am unable to use the resource output to set the connection string for use in subsequent modules e.g. passing into keyVault and functionAppSettings

wsucoug69 avatar Nov 08 '21 15:11 wsucoug69

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.

@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?

alex-frankel avatar Nov 08 '21 16:11 alex-frankel

For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.

However I am happy to look at an existing bicep file though to see if there are any issues.

I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.

https://github.com/Azure/azure-quickstart-templates/blob/master/quickstarts/microsoft.documentdb/cosmosdb-webapp/main.bicep

markjbrown avatar Nov 08 '21 16:11 markjbrown

here's my cosmosAccount.bicep

param location string
param cosmosAccountName string
param cosmosDefaultConsistencyPolicy string 
param cosmosPrimaryRegion string
param cosmosSecondaryRegion string

var lowerCosmosAcctName = toLower(cosmosAccountName)
var locations = [
  {
    locationName: cosmosPrimaryRegion
    failoverPriority: 0
    isZoneRedundant: false
  }
  {
    locationName: cosmosSecondaryRegion
    failoverPriority: 1
    isZoneRedundant: false
  }
]

resource cosmosAccountResource 'Microsoft.DocumentDB/databaseAccounts@2021-06-15' = {
  name: lowerCosmosAcctName
  kind: 'GlobalDocumentDB'
  location: location
  properties: {
    locations: locations
    databaseAccountOfferType: 'Standard'
    enableAutomaticFailover: true
    consistencyPolicy: {
      defaultConsistencyLevel: cosmosDefaultConsistencyPolicy
    }
  }
}


output cosmosAccountResourceName string = cosmosAccountResource.name

here's the KeyVault.bicep

param location string 
param keyVaultName string
param productionPrincipalId string
param productionTenantId string
param stagingPrincipalId string
param stagingTenantId string

@secure()
param cosmosPrimaryConnectionString string

@secure()
param cosmosSecondaryConnectionString string

@secure()
param serviceStorageConnectionString string

@secure()
param appStorageConnectionString string


resource keyVault 'Microsoft.KeyVault/vaults@2019-09-01' = {
  name: keyVaultName
  location: location
  properties: {
    enabledForDeployment: true
    enabledForTemplateDeployment: true
    enabledForDiskEncryption: true
    tenantId: productionTenantId
    accessPolicies: [
      {
        tenantId: productionTenantId
        objectId: productionPrincipalId
        permissions: {
          secrets: [
            'get'
            'list'
          ]
        }
      }
      {
        tenantId: stagingTenantId
        objectId: stagingPrincipalId
        permissions: {
          secrets: [
            'get'
            'list'
          ]
        }
      }
    ]
    sku: {
      name: 'standard'
      family: 'A'
    }
  }  
}

resource cosmosPrimaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/cosmosPrimaryConnectionString'
  properties: {
    value: cosmosPrimaryConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource cosmosSecondaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/cosmosSecondaryConnectionString'
  properties: {
    value: cosmosSecondaryConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource serviceStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/dbConnectionString'
  properties: {
    value: serviceStorageConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource appStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/appStorageConnectionString'
  properties: {
    value: appStorageConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

output appStorageConnectionStringUri string = appStorageConnectionStringSecret.properties.secretUri
output serviceStorageConnectionStringUri string = serviceStorageConnectionStringSecret.properties.secretUri
output cosmosPrimaryConnectionStringUri string = cosmosPrimaryConnectionStringSecret.properties.secretUri
output cosmosSecondaryConnectionStringUri string = cosmosSecondaryConnectionStringSecret.properties.secretUri

and here's the main.bicep

/// cosmos db account, database and container module
module cosmosAccountMod '../cosmosAccount.bicep' = {
  name: 'cosmosAccount-${environmentName}-${buildNumber}'
  params: {
    cosmosAccountName: cosmosAccountName
    cosmosDefaultConsistencyPolicy: cosmosDefaultConsistencyPolicy
    cosmosPrimaryRegion: cosmosPrimaryRegion
    cosmosSecondaryRegion: cosmosSecondaryRegion
    location: location
  }
}

module cosmosDatabaseMod '../cosmosDbContainer.bicep' = {
  name: 'cosmosDBContainer-${environmentName}-${buildNumber}'
  params: {
    cosmosAccountName: cosmosAccountMod.outputs.cosmosAccountResourceName
    cosmosContainerName: cosmosContainerName
    cosmosDatabaseName: cosmosDatabaseName
    cosmosThroughput: cosmosThroughput
  }
  dependsOn: [
    cosmosAccountMod
  ]
}

// storage account module - storage for the tenants application 
module appStorageAccountMod '../storageAccount.bicep' = {
  name: 'appStorageAcctName-${environmentName}-${buildNumber}'
  params: {
    storageAcctName: appStorageAcctName
    storageSkuName: appStorageAcctSku
    location: location
  }
}

// app insights module
module appInsightsMod '../appInsights.bicep' = {
  name: 'appInsightsName-${environmentName}-${buildNumber}'
  params: {
    name: appInsightsName
    resourceGroupLocation: location
  }
}

// app service plan module
module appServicePlanMod '../appServicePlan.bicep' = {
  name: 'appServicePlan-${environmentName}-${buildNumber}'
  params: {
    appSvcPlanSku: appSvcPlanSku
    appSvcPlanTier: appSvcPlanTier
    appSvcPlanName: appSvcPlanName
    appPlanLocation: location
  }
}

// function app module
module functionAppMod '../functionApp.bicep' = {
  name: 'functionApp-${environmentName}-${buildNumber}'
  params: {
    appSvcPlanName: appSvcPlanName
    functionAppName: functionAppName
    location: location
  }
  dependsOn: [
    appStorageAccountMod
    appServicePlanMod
    cosmosAccountMod
  ]
}

// service storage account module - storage for the function app 
module serviceStorageAccountMod '../storageAccount.bicep' = {
  name: 'serviceStorageAcctName-${environmentName}-${buildNumber}'
  params: {
    storageAcctName: serviceStorageAcctName
    storageSkuName: serviceStorageAcctSku
    location: location
  }
}

// key vault module
module keyVaultMod '../keyVault.bicep' = {
  name: 'keyVaultName-${environmentName}-${buildNumber}'
  params: {
    keyVaultName: keyVaultName
    location: location
    cosmosPrimaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[0].connectionString
    cosmosSecondaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[1].connectionString
    productionPrincipalId: functionAppMod.outputs.productionPrincipalId
    productionTenantId: functionAppMod.outputs.productionTenantId
    stagingPrincipalId: functionAppMod.outputs.stagingPrincipalId
    stagingTenantId: functionAppMod.outputs.stagingTenantId
    serviceStorageConnectionString: serviceStorageAccountMod.outputs.storageAccountConnectionString
    appStorageConnectionString: appStorageAccountMod.outputs.storageAccountConnectionString
  }
  dependsOn:[
    functionAppMod
    cosmosAccountMod
    cosmosDatabaseMod
  ]
}

// function app settings module
module functionAppSettingMod '../functionAppSettings.bicep' = {
  name: 'functionAppSettings-${environmentName}-${buildNumber}'
  params: {
    appInsightsKey: appInsightsMod.outputs.appInsightsKey
    cosmosConnectionStringUri: keyVaultMod.outputs.cosmosPrimaryConnectionStringUri
    appStorageConnectionStringUri: keyVaultMod.outputs.appStorageConnectionStringUri
    serviceStorageConnectionStringUri: keyVaultMod.outputs.serviceStorageConnectionStringUri
    functionAppName: functionAppMod.outputs.prodSlotFunctionAppName
    functionAppStagingName: functionAppMod.outputs.stagingSlotFunctionAppName
  }
  dependsOn:[
    functionAppMod
    appInsightsMod
    cosmosAccountMod
    keyVaultMod
  ]
}

wsucoug69 avatar Nov 09 '21 14:11 wsucoug69

Also to clarify previously I was using the output in the cosmosAccount.bicep but changed to the query approach to try ad get away from the error. Thanks for the tip on raising the support ticket.

wsucoug69 avatar Nov 09 '21 14:11 wsucoug69

For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.

However I am happy to look at an existing bicep file though to see if there are any issues.

I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.

https://github.com/Azure/azure-quickstart-templates/blob/master/quickstarts/microsoft.documentdb/cosmosdb-webapp/main.bicep

@alex-frankel Can you take a look at that? It seems the dependsOn is being fulfilled with the ack of the started and/or accepted responses rather than succeeded

wsucoug69 avatar Nov 09 '21 20:11 wsucoug69

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.

@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?

@alex-frankel any thoughts on the bicep here? Also I have opened a support case for this if you need that ref # let me know and I can send direct.

wsucoug69 avatar Nov 10 '21 14:11 wsucoug69

The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).

If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="

"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" "[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"

markjbrown avatar Nov 10 '21 18:11 markjbrown

The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).

If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="

"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" "[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"

@markjbrown apologies thank you for the assistance!!!

wsucoug69 avatar Nov 11 '21 15:11 wsucoug69

@zapadoody did this resolve your issue now?

brwilkinson avatar Nov 18 '21 04:11 brwilkinson

I think the most obvious reason why we need this is when you assign a role to an identity with: Microsoft.Authorization/roleAssignments and then do something with the role and identity in the same template, like with Microsoft.Resources/deploymentScripts for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.

erwinkramer avatar Jan 25 '22 17:01 erwinkramer

I think the most obvious reason why we need this is when you assign a role to an identity with: Microsoft.Authorization/roleAssignments and then do something with the role and identity in the same template, like with Microsoft.Resources/deploymentScripts for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.

at the role assignment template, try to set principalType to ServicePrincipal. It works like a charm in my environment.

azMantas avatar Jan 25 '22 20:01 azMantas

I think the most obvious reason why we need this is when you assign a role to an identity with: Microsoft.Authorization/roleAssignments and then do something with the role and identity in the same template, like with Microsoft.Resources/deploymentScripts for instance, or using something from a keyvault which it just got permissions from. This is not really nice to work with right now as it's almost guaranteed to fail at the first deployment, when the permissions are not set yet.

at the role assignment template, try to set principalType to ServicePrincipal. It works like a charm in my environment.

Does that guarantee anything? Setting roles, even manually, does not guarantee instant assignment of a role, this is what Microsoft documented itself, see https://docs.microsoft.com/en-us/azure/role-based-access-control/troubleshooting#role-assignment-changes-are-not-being-detected. In worst cases it takes 30 minutes, and I've seen it take over 5 minutes myself. I'm not saying that you're wrong in your scenario, just saying that not all scenario's will be instant with RBAC assignments.

erwinkramer avatar Jan 25 '22 20:01 erwinkramer

@erwinkramer is correct, there are 2 problems with replication in this RBAC scenario

  1. the MSI replicating through AAD/Azure so that a role can be assigned
  2. the roleAssignment replicating through Azure so it takes effect

The principalType property solves the first but not the second.

In worst cases it takes 30 minutes, and I've seen it take over 5 minutes myself. I'm not saying that you're wrong in your scenario, just saying that not all scenario's will be instant with RBAC assignments. This is the challenge with wait/retry in general... When do you know that you should and how long do you wait for? We've talked about something like "wait until I can GET this resource" but that still has replication and fanout issues...

We understand the pain, and there are some workarounds (e.g. serial deployment of resources) - the current guidance from leadership is to solve the root cause.

bmoore-msft avatar Jan 25 '22 20:01 bmoore-msft

For policy as well ... When you create an initiative definition then an initiative assignment > Error > Wait a bit between both > succes

RK6183 avatar Mar 23 '22 13:03 RK6183

For policy as well ... When you create an initiative definition then an initiative assignment > Error > Wait a bit between both > succes

Azure CLI 'wait' command may be used to wait until resource provisioned with 'Succeeded' stage az deployment mg create --name deploymentName az deployment mg wait --name deploymentName --created --management-group-id mgmtName

azMantas avatar Mar 28 '22 18:03 azMantas

To add a comment here, I'm not sure why are we trying to find workarounds for a situation the resource provider should address. If the resource provider doesn't support concurrent operations, then serializing should be fine. However, if there's a situation like, resource A returns the operation as complete, but it's still doing something (e.g.: replication) then why is the Resource Provider signaling ARM that the operation is completed and ready for any other operation?

milope avatar May 12 '22 00:05 milope