pulumi-azure-native icon indicating copy to clipboard operation
pulumi-azure-native copied to clipboard

Refreshing project involving machine learning services managed network settings causes resources to be replaced

Open pierskarsenbarg opened this issue 10 months ago β€’ 11 comments

What happened?

When creating a MachineLearningService.Workspace it's possible to create the managednetwork outbound rules as a separate resource (and apparently customers are encouraged to do so by Microsoft). However, when you do this and run pulumi refresh it updates the workspace resource with the values set in the network settings resource and then deletes the network settings rule resource.

Then when the update happens, Pulumi recreates the settingsrule resource (because it no longer exists in Azure) and clears out the managed settings rule in the workspace and then the same thing happens all over again.

Example

Code can be found here: https://github.com/pulumi/customer-engineering/tree/master/customer-support/5019 (private repo)

Output of pulumi about

CLI
Version      3.113.0
Go Version   go1.22.2
Go Compiler  gc

Plugins
NAME          VERSION
azure-native  2.27.0
dotnet        unknown

Host
OS       darwin
Version  14.4.1
Arch     arm64

This project is written in dotnet: executable='/usr/local/share/dotnet/dotnet' version='8.0.201'

Current Stack: pierskarsenbarg/azure-ml-csharp/dev

TYPE                                                      URN
pulumi:pulumi:Stack                                       urn:pulumi:dev::azure-ml-csharp::pulumi:pulumi:Stack::azure-ml-csharp-dev
pulumi:providers:azure-native                             urn:pulumi:dev::azure-ml-csharp::pulumi:providers:azure-native::default_2_27_0
azure-native:resources:ResourceGroup                      urn:pulumi:dev::azure-ml-csharp::azure-native:resources:ResourceGroup::pl-cs-ml-rg
azure-native:storage:StorageAccount                       urn:pulumi:dev::azure-ml-csharp::azure-native:storage:StorageAccount::mlsa
azure-native:operationalinsights:Workspace                urn:pulumi:dev::azure-ml-csharp::azure-native:operationalinsights:Workspace::appInsights
azure-native:insights:Component                           urn:pulumi:dev::azure-ml-csharp::azure-native:insights:Component::component
azure-native:machinelearningservices/v20231001:Workspace  urn:pulumi:dev::azure-ml-csharp::azure-native:machinelearningservices/v20231001:Workspace::mlworkspace


Found no pending operations associated with dev

Backend
Name           pulumi.com
URL            https://app.pulumi.com/pierskarsenbarg
User           pierskarsenbarg
Organizations  pierskarsenbarg, karsenbarg, team-ce, demo
Token type     personal

Dependencies:
NAME                VERSION
Pulumi              3.61.0
Pulumi.AzureNative  2.27.0

Pulumi locates its logs in /var/folders/x8/cdd9j87s607fwpy0q62mfmmw0000gn/T/ by default

Additional context

If you look at state export that I've taken before and after the refresh, you can see here that the outboundRules input is empty and we have networksettingsrule resource here: https://github.com/pulumi/customer-engineering/blob/60d8ce2e2835bab670bce807e276934186ebb1fd/customer-support/5019/stack-before-refresh.json#L421-L479

but after the refresh the ML workspace now has the network settings in the outboundRule section: https://github.com/pulumi/customer-engineering/blob/60d8ce2e2835bab670bce807e276934186ebb1fd/customer-support/5019/stack-after-refresh.json#L324-L334 and the setting rule resource is gone.

I've also added https://github.com/pulumi/customer-engineering/blob/master/customer-support/5019/out.txt which are verbose logs and if you search for HTTP/2.0 404 Not Found you can see that it's the rules that can't be found.

Contributing

Vote on this issue by adding a πŸ‘ reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

pierskarsenbarg avatar Apr 18 '24 17:04 pierskarsenbarg

This looks like a similar problem to https://github.com/pulumi/pulumi-azure-native/issues/611 and https://github.com/pulumi/pulumi-azure-native/issues/1112 - where certain sub-property collections are also represented as standalone resources.

One oddity here is that the standalone resource gets removed on refresh. We'd need to identify why this isn't returned correctly by the API as there might be something else going on. Otherwise, the existing mechanism should be sufficient to handle maintaining the existing property transparently if it wasn't originally set.

danielrbradley avatar Apr 19 '24 08:04 danielrbradley

It looks like there's an issue with this resource's GET function. Running the repro myself showed this warning in the diagnostics:

Updating (dev)

View in Browser (Ctrl+O): https://app.pulumi.com/daniel-pulumi-corp/sdf/dev/updates/43

     Type                                                                Name         Status              Info
     pulumi:pulumi:Stack                                                 sdf-dev                          
 +   β”œβ”€ azure-native:resources:ResourceGroup                             pl-cs-ml-rg  created (0.66s)     
 +   β”œβ”€ azure-native:storage:StorageAccount                              mlsa         created (23s)       
 +   β”œβ”€ azure-native:operationalinsights:Workspace                       appInsights  created (31s)       
 +   β”œβ”€ azure-native:keyvault:Vault                                      vault        created (34s)       
 +   β”œβ”€ azure-native:insights:Component                                  component    created (1s)        
 +   β”œβ”€ azure-native:machinelearningservices/v20231001:Workspace         mlworkspace  created (33s)       
 +   └─ azure-native:machinelearningservices:ManagedNetworkSettingsRule  mlsrule      created (32s)       1 warning

Diagnostics:
  azure-native:machinelearningservices:ManagedNetworkSettingsRule (mlsrule):
    warning: Failed to read resource after Create. Please report this issue.
        Verbose logs contain more information, see https://www.pulumi.com/docs/support/troubleshooting/#verbose-logging.
Here's the repro, modified to include the missing Vault
using Pulumi;
using Pulumi.AzureNative.Resources;
using Pulumi.AzureNative.Storage;
using OperationalInsights = Pulumi.AzureNative.OperationalInsights;
using Pulumi.AzureNative.OperationalInsights.Inputs;
using Insights = Pulumi.AzureNative.Insights;
using Pulumi.AzureNative.KeyVault;
using MLS = Pulumi.AzureNative.MachineLearningServices;
using Pulumi.AzureNative.MachineLearningServices.Inputs;
using System;

return await Pulumi.Deployment.RunAsync(() =>
{
    var resourceGroup = new ResourceGroup("pl-cs-ml-rg", new ResourceGroupArgs
    {
        Location = "uksouth"
    });

    var storageAccount = new StorageAccount("mlsa", new Pulumi.AzureNative.Storage.StorageAccountArgs
    {
        ResourceGroupName = resourceGroup.Name,
        Sku = new Pulumi.AzureNative.Storage.Inputs.SkuArgs
        {
            Name = Pulumi.AzureNative.Storage.SkuName.Standard_LRS
        },
        Kind = Kind.StorageV2
    });

    var appInsights = new OperationalInsights.Workspace("appInsights", new()
    {
        Features = new WorkspaceFeaturesArgs
        {
            EnableLogAccessUsingOnlyResourcePermissions = true
        },
        PublicNetworkAccessForIngestion = OperationalInsights.PublicNetworkAccessType.Enabled,
        PublicNetworkAccessForQuery = OperationalInsights.PublicNetworkAccessType.Enabled,
        ResourceGroupName = resourceGroup.Name,
        RetentionInDays = 30,
        Sku = new WorkspaceSkuArgs
        {
            Name = OperationalInsights.WorkspaceSkuNameEnum.PerGB2018
        },
        WorkspaceCapping = new WorkspaceCappingArgs
        {
            DailyQuotaGb = -1
        },
        WorkspaceName = "pkmlcsopworkspace"
    });


    var insightsComponent = new Insights.Component("component", new()
    {
        ApplicationType = Insights.ApplicationType.Web,
        FlowType = "Redfield",
        IngestionMode = Insights.IngestionMode.LogAnalytics,
        Kind = "web",
        PublicNetworkAccessForIngestion = Insights.PublicNetworkAccessType.Enabled,
        PublicNetworkAccessForQuery = Insights.PublicNetworkAccessType.Enabled,
        RequestSource = "IbizaMachineLearningExtension",
        ResourceGroupName = resourceGroup.Name,
        ResourceName = "pkmlcsinworkspace",
        RetentionInDays = 90,
        WorkspaceResourceId = appInsights.Id
    });

    var vault = new Vault("vault", new VaultArgs
    {
        ResourceGroupName = resourceGroup.Name,
        Properties = new Pulumi.AzureNative.KeyVault.Inputs.VaultPropertiesArgs
        {
            TenantId = "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx",
            Sku = new Pulumi.AzureNative.KeyVault.Inputs.SkuArgs
            {
                Family = "A",
                Name = Pulumi.AzureNative.KeyVault.SkuName.Standard,
            },
        }
    });

    var mlWorkspace = new MLS.V20231001.Workspace("mlworkspace", new()
    {
        ApplicationInsights = insightsComponent.Id,
        FriendlyName = "pk-ml-cs-workspace",
        Description = "",
        DiscoveryUrl = "https://uksouth.api.azureml.ms/discovery",
        HbiWorkspace = false,
        Identity = new MLS.V20231001.Inputs.ManagedServiceIdentityArgs
        {
            Type = MLS.V20231001.ManagedServiceIdentityType.SystemAssigned
        },
        KeyVault = vault.Id,
        ManagedNetwork = new MLS.V20231001.Inputs.ManagedNetworkSettingsArgs
        {
            IsolationMode = "AllowInternetOutbound",
            OutboundRules = new InputMap<object>()
        },
        PublicNetworkAccess = MLS.V20231001.PublicNetworkAccess.Enabled,
        ResourceGroupName = resourceGroup.Name,
        Sku = new MLS.V20231001.Inputs.SkuArgs
        {
            Name = "Basic",
            Tier = MLS.V20231001.SkuTier.Basic
        },
        StorageAccount = storageAccount.Id,
        WorkspaceName = "pk-ml-cs-workspace"
    });

    var rule = new MLS.ManagedNetworkSettingsRule("mlsrule", new()
    {
        ResourceGroupName = resourceGroup.Name,
        WorkspaceName = mlWorkspace.Name,
        Properties = new PrivateEndpointOutboundRuleArgs
        {
            Destination = new PrivateEndpointDestinationArgs
            {
                SparkEnabled = false,
                ServiceResourceId = storageAccount.Id,
                SubresourceTarget = "blob"
            },
            Category = MLS.RuleCategory.UserDefined,
            Type = "PrivateEndpoint"
        },
    });
});

Running this with verbose logs shows that it's attempting to perform the GET against https://management.azure.com/pk-ml-cs-workspace-2/outboundRules/mlsrule?api-version=2023-10-01.

According to the specification, this endpoint should be /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.MachineLearningServices/workspaces/{workspaceName}/outboundRules/{ruleName}.

Running this under a debugger shows that the URL is being constructed from the resource's id field, which for all other resources starts with /subscriptions/0282681f-7a9e-424b-80b2-96babd57a8a1/resourceGroups/pl-cs-ml-rg20c310d8 etc, but this resource is starting with just the id of the ml workspace.

Side note: I don't seem able to delete this rule resource within the portal either - nothing happens when I click the delete button: image

danielrbradley avatar Apr 19 '24 13:04 danielrbradley

No you have to delete the ml workspace to delete the rule in the portal.

pierskarsenbarg avatar Apr 19 '24 13:04 pierskarsenbarg

When performing the create, we calculate the id based on the specification and send the PUT request correctly.

The Azure response contains an id field which, if present, we take to be the authoritive source of the canonical Azure identifier for the resource:

https://github.com/pulumi/pulumi-azure-native/blob/d422a4264e04244bd42a188b798e4c72fec7fa73/provider/pkg/provider/provider.go#L909-L911

However, on this API, Azure is returning an identifier which is missing the subscription and resource group context:

Original ID: "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/pl-cs-ml-rgxxxxxx/providers/Microsoft.MachineLearningServices/workspaces/pk-ml-cs-workspace-2/outboundRules/mlsrule"
Azure's ID: "pk-ml-cs-workspace-2/outboundRules/mlsrule"

This incorrect identifier is then used for subsequent requests, which then fail.

danielrbradley avatar Apr 19 '24 13:04 danielrbradley

It knows it’s attached to the workspace because it attaches the rule to the workspace which is weird. I thought it was like that vnet/subnet issue from before

pierskarsenbarg avatar Apr 19 '24 13:04 pierskarsenbarg

Manually fixing the id field in the state (exporting, editing then importing) the performing a refresh shows only the diff of the Workspace resource:

     Type                                                                          Name         Plan       Info
     pulumi:pulumi:Stack                                                           sdf-dev                 
     β”œβ”€ azure-native:resources:ResourceGroup                                       pl-cs-ml-rg             
     β”œβ”€ azure-native:machinelearningservices/v20231001:ManagedNetworkSettingsRule  mlsrule                 
     β”œβ”€ azure-native:keyvault:Vault                                                vault                   
 ~   β”œβ”€ azure-native:machinelearningservices/v20231001:Workspace                   mlworkspace  update     [diff: ~managedNetwork,systemData]
     β”œβ”€ azure-native:storage:StorageAccount                                        mlsa                    
     β”œβ”€ azure-native:operationalinsights:Workspace                                 appInsights             
     └─ azure-native:insights:Component                                            component               

Resources:
    ~ 1 to update
    7 unchanged

Therefore there's two separate issues at play here:

  1. Azure's API is broken and returning the wrong id - causing subsequent operations to the rule to fail.
  2. We need to introduce the parent-child awareness to the Workspace to ensure it doesn't overwrite when all rules are managed externally (like VNets).

danielrbradley avatar Apr 19 '24 13:04 danielrbradley

As it stands, I would advise against using the standalone rule resource until it's fixed by Azure as it's not currently workable given the broken API. At that point, it would be worth considering the enhancement to make it easier to manage rules just as external resources similar to VNets/Subnets.

danielrbradley avatar Apr 19 '24 13:04 danielrbradley

Yeah I’ve already advised them not to but they were only doing it because MS told them to.

I don’t know how keen they are to start pushing MS to fix it but I’ll find out

pierskarsenbarg avatar Apr 19 '24 13:04 pierskarsenbarg

Ok, marking as blocked as there's no reason to start work on the ability to use standalone rule resources until we can successfully rules as resources without errors from the API.

danielrbradley avatar Apr 19 '24 13:04 danielrbradley

@danielrbradley which API endpoint is returning the wrong id? We are engaging with MS on this from our side.

Werner-Swart-83 avatar Apr 25 '24 13:04 Werner-Swart-83

@Werner-Swart-83 the incorrect ID is being returned after awaiting the result of the PUT /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.MachineLearningServices/workspaces/{workspaceName}/outboundRules/{ruleName} creates a response which includes an id field with the value {workspaceName}/outboundRules/{ruleName}.

e.g. PUT /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/pl-cs-ml-rgxxxxxx/providers/Microsoft.MachineLearningServices/workspaces/pk-ml-cs-workspace-2/outboundRules/mlsrule Responds with { "id": "pk-ml-cs-workspace-2/outboundRules/mlsrule", ... }

danielrbradley avatar Apr 25 '24 16:04 danielrbradley

@danielrbradley MS has confirmed that the fix for that endpoint has been applied to WestEurope.

Werner-Swart-83 avatar May 13 '24 14:05 Werner-Swart-83

That's great! Thank you for the update @Werner-Swart-83

danielrbradley avatar May 13 '24 15:05 danielrbradley

@danielrbradley Is this issue unblocked then?

mikhailshilkov avatar Jun 06 '24 12:06 mikhailshilkov

@danielrbradley Is this issue unblocked then?

This might be entirely fixed now by Microsoft's change. We should re-test and check if it still has an issue around being managed as part of a parent resource or if that was an incorrect assessment in the early comments.

If we can confirm it's fixed and the repro in https://github.com/pulumi/pulumi-azure-native/issues/3225#issuecomment-2066546278 now passes, then we can close out this ticket.

danielrbradley avatar Jun 06 '24 12:06 danielrbradley

Great! Marked this as needs-triage to re-test it.

mikhailshilkov avatar Jun 06 '24 14:06 mikhailshilkov

On re-testing now, I see the following refresh:


     Type                                                                          Name         Plan       Info
     pulumi:pulumi:Stack                                                           sdf-dev                 
 ~   β”œβ”€ azure-native:machinelearningservices/v20231001:Workspace                   mlworkspace  update     [diff: ~managedNetwork,systemData]
     β”œβ”€ azure-native:resources:ResourceGroup                                       pl-cs-ml-rg             
     β”œβ”€ azure-native:machinelearningservices/v20231001:ManagedNetworkSettingsRule  mlsrule                 
     β”œβ”€ azure-native:operationalinsights:Workspace                                 appInsights             
     β”œβ”€ azure-native:insights:Component                                            component               
     β”œβ”€ azure-native:storage:StorageAccount                                        mlsa                    
     └─ azure-native:keyvault:Vault                                                vault                   

Resources:
    ~ 1 to update
    7 unchanged

The deletion of the ManagedNetworkSettingsRule no longer appears, so I'll close this as complete.

danielrbradley avatar Jun 07 '24 12:06 danielrbradley