azure-service-operator
azure-service-operator copied to clipboard
Bug: TrafficManagerProfile and DnsZonesCNAMERecord resources don't resynced if "Conflict"
Describe the bug
When i'm trying to create at the same time TrafficManagerProfile and DnsZonesCNAMERecord pointed on it. TrafficManagerProfile is created, DnsZonesCNAMERecord is created.
But in some time both resources is being stuck with errors Conflict:
TrafficManagerProfile:
Reason: Conflict, Severity: Error, RetryClassification: RetrySlow, Cause:
Conflicting changes were detected when processing the request. This can happen when there are multiple requests trying to update one profile at the same time. Please retry your request.: PUT https://management.azure.com/subscriptions/xxx/resourceGroups/test-stg/providers/Microsoft.Network/trafficmanagerprofiles/xxx -------------------------------------------------------------------------------- RESPONSE 409: 409 Conflict ERROR CODE: Conflict -------------------------------------------------------------------------------- { "error": { "code": "Conflict", "message": "Conflicting changes were detected when processing the request. This can happen when there are multiple requests trying to update one profile at the same time. Please retry your request." } } --------------------------------------------------------------------------------
DnsZonesCNAMERecord
Reason: Conflict, Severity: Error, RetryClassification: RetrySlow, Cause:
Another operation is pending for requested object. Operation group '/operations/groups/id/|subscriptions|xxx|resourceGroups|xxx|dnsZones|test.com|CNAME|test' already has 1 operations like '/operations/type/UpsertAliasRecordSet/id/xxx' queued.: PUT https://management.azure.com/subscriptions/xxx/resourceGroups/test-stg/providers/Microsoft.Network/dnszones/test.com/CNAME/test -------------------------------------------------------------------------------- RESPONSE 409: 409 Conflict ERROR CODE: Conflict -------------------------------------------------------------------------------- { "code": "Conflict", "message": "Another operation is pending for requested object. Operation group '\/operations\/groups\/id\/|subscriptions|xxx|resourceGroups|test-stg|dnsZones|test.com|CNAME|test-azure' already has 1 operations like '\/operations\/type\/UpsertAliasRecordSet\/id\/xxx' queued." } --------------------------------------------------------------------------------
If i restart operator both resources returns to valid state or i can recreate them to fix the issue it helps for some time, until it's conflict again
Azure Service Operator Version: v2.12.0
Expected behavior
TrafficManagerProfile and DnsZonesCNAMERecord resources must be re-synced after Conflict error
To Reproduce
operator's values for sync operations:
azureSyncPeriod: 10m
maxConcurrentReconciles: 10
TrafficManagerProfile
apiVersion: network.azure.com/v1api20220401storage
kind: TrafficManagerProfile
metadata:
name: test
namespace: test
spec:
azureName: test
dnsConfig:
relativeName: test
ttl: 30
location: global
monitorConfig:
port: 80
protocol: TCP
originalVersion: v1api20220401
owner:
armId: >-
/subscriptions/xxx/resourceGroups/xxx
trafficRoutingMethod: Performance
DnsZonesCNAMERecord
apiVersion: network.azure.com/v1api20180501storage
kind: DnsZonesCNAMERecord
metadata:
name: test
namespace: test
spec:
TTL: 30
azureName: test
originalVersion: v1api20180501
owner:
armId: >-
/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.Network/dnszones/test.com
targetResource:
reference:
armId: >-
/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.Network/trafficManagerProfiles/test
Additional context
Slack thread https://kubernetes.slack.com/archives/C046DEVLAQM/p1743417179497179
I think the plan here is to make all Conflict's retryable.
I've got a fix for error classification in #4671 but it's worth nothing that this won't prevent the conflicts from occuring, it will prevent ASO from stalling when they do.
@matthchr, do we need to implement a PreReconciliationChecker for TrafficManagerProfile and DnsZonesCNAMERecord (and potentially other related components) to check ProvisioningState (as we do for ManagedCluster before attempting a PUT?
(@masteredd - we've seen other resources where any attempt to PUT an update will be simply rejected if there's a concurrent update; ManagedCluster is one. For most Azure resources this doesn't matter, but I'm wondering if it does for these ones.)
Hitting a conflict isn't itself that bad. IIRC we added the checker for ManagedCluster either due to a bug in AKS RP where hitting the conflict caused the AgentPool to get into a bad state, or throttling (or both).
In this case, it's not clear to me that checking the resource in question is sufficient, as the issue also seems to be related to ongoing operations on the linked resource, as well?
This was closed as part of #4671