terraform-provider-azapi icon indicating copy to clipboard operation
terraform-provider-azapi copied to clipboard

AML workspace outbound rules remove newly created rules

Open kimyen opened this issue 10 months ago • 5 comments

Brief description of the problem

  • When using Microsoft.MachineLearningServices/workspaces/outboundRules@2023-10-01 as documented here to create multiple outbound rules, the behaviors is non deterministic and destructive.
    • Destructive: when create multiple rules in the same plan, while the rules are being created (~20 minutes), one can observed on the Azure portal and a previously created rule would be deleted while a new one is created.
    • Non deterministic: It's indeterministic that which rules will be deleted. For example: a plan that created 6 FQDN rules and 2 Private Endpoint rules results in all of them created, 5 FQDN + 1 Private Endpoint rules deleted after creation. This results in 1 FQDN + 1 Private Endpoint exist at the end. When adding 2 FQDN + 1 Private Endpoint rules, 1/2 new FQDN rule is deleted after created. The existing FQDN rule remain, while the existing Private Endpoint rule was deleted. It's not clear what determines if a rule would be deleted.
  • The terraform apply will report successfully created X rules. terraform state list & terraform state show can correctly show to the rules created. However, the Azure portal shows only the one that has not been deleted. The following terraform plan without code change will show that the rules that are deleted need to be re-created.

How to reproduce

Step 1: Add the following to your workspace .tf file:

resource "azapi_resource" "conda_anaconda_outbound_rules" {
  type = "Microsoft.MachineLearningServices/workspaces/outboundRules@2023-10-01"
  name = "conda-anaconda-org"
  parent_id = azurerm_machine_learning_workspace.aml_workspace.id
  body = jsonencode({
    properties = {
      category = "UserDefined"
      status = "Active"
      type = "FQDN"
      destination = "conda.anaconda.org"
    }
  })
}

resource "azapi_resource" "repo_anaconda_outbound_rules" {
  type = "Microsoft.MachineLearningServices/workspaces/outboundRules@2023-10-01"
  name = "repo-anaconda-org"
  parent_id = azurerm_machine_learning_workspace.aml_workspace.id
  body = jsonencode({
    properties = {
      category = "UserDefined"
      status = "Active"
      type = "FQDN"
      destination = "repo.anaconda.org"
    }
  })
}
  • [Optional] Run terraform plan, there should be 2 FQDN rules to be created

Step 2: Run terraform apply, after ~20 minutes, it should succeed

  • [Optional] Run terraform state list, the state should be present
  • [Optional] Run terraform state shown, each state should have full details

Step 3: Run terraform plan, 1/2 rules need to be created

  • [Optional] From the Azure portal, only 1/2 FQDN rule show up

Other setup

  • AML workspace networking has public access disabled
  • AML workspace outbound config is set to Allow only approved outbound

Desired resolution

  • After running terraform apply and it runs to completion, all outbound rules are created and visible in Azure portal. The following terraform plan without code change results in no changes.

kimyen avatar Apr 19 '24 04:04 kimyen

Hi @kimyen ,

Thank you for taking time open this issue and apologize for late response.

Thanks for the details and I could reproduce this issue. It seems that this API only works if the outbound rules are created one by one.

The azapi_resource supports locks field which allows user to specify a list of ARM resource IDs which are used to avoid create/modify/delete azapi resources at the same time.

But I also noticed that there is an API bug(https://github.com/Azure/azure-rest-api-specs/issues/28982) which will make the azapi v1.13.x crash. I have two workarounds for this case, hope it could help.

Workaround 1. (Recommended)

  1. Use azapi v1.12.1 to deploy the following config, and you could upgrade to the latest once the bug fix is released.
resource "azurerm_machine_learning_workspace" "example" {
  name                    = "acctesthenglu562"
  location                = azurerm_resource_group.example.location
  resource_group_name     = azurerm_resource_group.example.name
  application_insights_id = azurerm_application_insights.example.id
  key_vault_id            = azurerm_key_vault.example.id
  storage_account_id      = azurerm_storage_account.example.id

  identity {
    type = "SystemAssigned"
  }
  public_network_access_enabled = true
  managed_network {
    isolation_mode  = "AllowOnlyApprovedOutbound"
  }
}

resource "azapi_resource" "example" {
  count = 3
  type = "Microsoft.MachineLearningServices/workspaces/outboundRules@2023-10-01"
  name = "test2${count.index}"
  parent_id = azurerm_machine_learning_workspace.example.id
  body = jsonencode({
    properties = {
      category = "UserDefined"
      status = "Active"
      type = "FQDN"
      destination = "conda.anaconda${count.index}.org"
    }
  })
  locks = [azurerm_machine_learning_workspace.example.id]
}

Workaround 2. If you prefer the dynamic properties that v1.13.x provides, you could use the azapi_resource_action to bypass the bug, however the action resource doesn't monitor the resource's state.

data "azapi_resource_id" "outboundRules" {
  count     = 3
  type      = "Microsoft.MachineLearningServices/workspaces/outboundRules@2023-10-01"
  name      = "test2${count.index}"
  parent_id = azurerm_machine_learning_workspace.example.id
}

resource "azapi_resource_action" "outboundRules" {
  count       = 3
  type        = "Microsoft.MachineLearningServices/workspaces/outboundRules@2023-10-01"
  resource_id = data.azapi_resource_id.outboundRules[count.index].id
  method      = "PUT"
  locks       = [azurerm_machine_learning_workspace.example.id]
  body = {
    properties = {
      category    = "UserDefined"
      status      = "Active"
      type        = "FQDN"
      destination = "repo.anaconda.org${count.index}"
    }
  }
}

ms-henglu avatar May 06 '24 05:05 ms-henglu

@ms-henglu Do you have any additional updates on this?

I did try the first workaround you listed (the second won't work for us as we need Terraform to monitor the state). It worked, partially. I was able to create the FQDNs and PEs outbound, but I had to force them to be created only one at a time.

In addition, if you later add/modify the azapi rules, terraform still sometimes deletes previously created outbound rules. This makes it extremely difficult to add new rules.

Chaseshak avatar Jun 18 '24 15:06 Chaseshak

@ms-henglu Do you have any additional updates on this?

I did try the first workaround you listed (the second won't work for us as we need Terraform to monitor the state). It worked, partially. I was able to create the FQDNs and PEs outbound, but I had to force them to be created only one at a time.

In addition, if you later add/modify the azapi rules, terraform still sometimes deletes previously created outbound rules. This makes it extremely difficult to add new rules.

@Chaseshak how did you force to be created only one at a time

krupakar1329 avatar Jun 19 '24 06:06 krupakar1329

Hi @krupakar1329 - You could use the lock field to force the rules to be created only once at a time. Please see above comment.

ms-henglu avatar Jun 20 '24 08:06 ms-henglu

Hi @Chaseshak,

In addition, if you later add/modify the azapi rules, terraform still sometimes deletes previously created outbound rules. This makes it extremely difficult to add new rules.

Would you please provide some details so I could reproduce it? Thanks!

ms-henglu avatar Jun 20 '24 08:06 ms-henglu