azure-service-operator icon indicating copy to clipboard operation
azure-service-operator copied to clipboard

Bug: TrafficManagerProfile and DnsZonesCNAMERecord resources don't resynced if "Conflict"

Open masteredd opened this issue 7 months ago • 4 comments

Describe the bug

When i'm trying to create at the same time TrafficManagerProfile and DnsZonesCNAMERecord pointed on it. TrafficManagerProfile is created, DnsZonesCNAMERecord is created.

But in some time both resources is being stuck with errors Conflict:

TrafficManagerProfile:

Reason: Conflict, Severity: Error, RetryClassification: RetrySlow, Cause:
Conflicting changes were detected when processing the request. This can happen when there are multiple requests trying to update one profile at the same time. Please retry your request.: PUT https://management.azure.com/subscriptions/xxx/resourceGroups/test-stg/providers/Microsoft.Network/trafficmanagerprofiles/xxx -------------------------------------------------------------------------------- RESPONSE 409: 409 Conflict ERROR CODE: Conflict -------------------------------------------------------------------------------- { "error": { "code": "Conflict", "message": "Conflicting changes were detected when processing the request. This can happen when there are multiple requests trying to update one profile at the same time. Please retry your request." } } --------------------------------------------------------------------------------

DnsZonesCNAMERecord

Reason: Conflict, Severity: Error, RetryClassification: RetrySlow, Cause:
Another operation is pending for requested object. Operation group '/operations/groups/id/|subscriptions|xxx|resourceGroups|xxx|dnsZones|test.com|CNAME|test' already has 1 operations like '/operations/type/UpsertAliasRecordSet/id/xxx' queued.: PUT https://management.azure.com/subscriptions/xxx/resourceGroups/test-stg/providers/Microsoft.Network/dnszones/test.com/CNAME/test -------------------------------------------------------------------------------- RESPONSE 409: 409 Conflict ERROR CODE: Conflict -------------------------------------------------------------------------------- { "code": "Conflict", "message": "Another operation is pending for requested object. Operation group '\/operations\/groups\/id\/|subscriptions|xxx|resourceGroups|test-stg|dnsZones|test.com|CNAME|test-azure' already has 1 operations like '\/operations\/type\/UpsertAliasRecordSet\/id\/xxx' queued." } --------------------------------------------------------------------------------

If i restart operator both resources returns to valid state or i can recreate them to fix the issue it helps for some time, until it's conflict again

Azure Service Operator Version: v2.12.0

Expected behavior

TrafficManagerProfile and DnsZonesCNAMERecord resources must be re-synced after Conflict error

To Reproduce

operator's values for sync operations:

azureSyncPeriod: 10m
maxConcurrentReconciles: 10

TrafficManagerProfile

apiVersion: network.azure.com/v1api20220401storage
kind: TrafficManagerProfile
metadata:
  name: test
   namespace: test
spec:
  azureName: test
  dnsConfig:
    relativeName: test
    ttl: 30
  location: global
  monitorConfig:
    port: 80
    protocol: TCP
  originalVersion: v1api20220401
  owner:
    armId: >-
      /subscriptions/xxx/resourceGroups/xxx
  trafficRoutingMethod: Performance

DnsZonesCNAMERecord

apiVersion: network.azure.com/v1api20180501storage
kind: DnsZonesCNAMERecord
metadata:
  name: test
  namespace: test
spec:
  TTL: 30
  azureName: test
  originalVersion: v1api20180501
  owner:
    armId: >-
      /subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.Network/dnszones/test.com
  targetResource:
    reference:
      armId: >-
        /subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.Network/trafficManagerProfiles/test

Additional context

Slack thread https://kubernetes.slack.com/archives/C046DEVLAQM/p1743417179497179

masteredd avatar Apr 02 '25 10:04 masteredd

I think the plan here is to make all Conflict's retryable.

matthchr avatar Apr 03 '25 22:04 matthchr

I've got a fix for error classification in #4671 but it's worth nothing that this won't prevent the conflicts from occuring, it will prevent ASO from stalling when they do.

theunrepentantgeek avatar Apr 04 '25 03:04 theunrepentantgeek

@matthchr, do we need to implement a PreReconciliationChecker for TrafficManagerProfile and DnsZonesCNAMERecord (and potentially other related components) to check ProvisioningState (as we do for ManagedCluster before attempting a PUT?

(@masteredd - we've seen other resources where any attempt to PUT an update will be simply rejected if there's a concurrent update; ManagedCluster is one. For most Azure resources this doesn't matter, but I'm wondering if it does for these ones.)

theunrepentantgeek avatar Apr 04 '25 03:04 theunrepentantgeek

Hitting a conflict isn't itself that bad. IIRC we added the checker for ManagedCluster either due to a bug in AKS RP where hitting the conflict caused the AgentPool to get into a bad state, or throttling (or both).

In this case, it's not clear to me that checking the resource in question is sufficient, as the issue also seems to be related to ongoing operations on the linked resource, as well?

matthchr avatar Apr 04 '25 17:04 matthchr

This was closed as part of #4671

matthchr avatar Jul 14 '25 21:07 matthchr