azure-service-operator Reconcile should perform a diff with Azure rather than relying on a spec hash

Relying on a spec has means that we can't notice changes to the underlying Azure resource and fix them, which is one of the big selling points of the operator pattern in the first place.

May 18 '21 21:05 matthchr

Today we calculate a specHash on reconcile and save it as an annotation. The act of putting something in annotations triggers a subsequent reconcile (although the subsequent reconcile doesn't do much because we realize that the spec hasn't changed).

We want to make sure to minimize the work we're doing in the operator. If we start blindly issuing a GET to Azure for every reconcile loop we're probably going to end up getting throttled.

Here are some considerations:

There are going to be properties on the CRD that we don't mirror to Azure. For example things like labels and/or annotations may not be mirrored to Azure. Do we want to diff with Azure every time these are updated? There may be things that somewhat frequently update labels or annotations that may end up driving a lot of reconcile load (and subsequent GET load on Azure).
An operator running an older version of the operator code in a cluster with newer versions of the CRD. As long as the operator is operating on older versions of the CRDs there's no issue, but if it ends up running on a newer version of the CRD it may see updates to properties it doesn't understand (which will then not actually be present in the API version that operator is reconciling on), causing us to attempt to reconcile a resource that we can't tell has changed. @babbageclunk as he's investigating this scenario.

On the other hand, avoiding reconciles we should've taken leaves us much in the same situation as with specHash, where something has changed in Azure and the customer may be trying to force a reconciliation but be unable to (because we're being clever and ignoring annotation updates or something)... so the problem cuts both ways.

Jun 04 '21 19:06 matthchr

Hi, I'm quite interested in this issue, because I'm evaluating Crossplane in conjunction with Azure Service Operator and Crossplane's provider-azure for productive usage.

Currently, I am investigating the reconciliation behaviour of both operators. In short: if I deploy a ResourceGroup via ASO v2.0.0-alpha.3 and then delete it via the portal it won't get recreated. However, if I deploy it via provider-azure it will be recreated quickly.

To show this behaviour, I created the following ResourceGroup manually:

apiVersion: microsoft.resources.azure.com/v1alpha1api20200601
kind: ResourceGroup
metadata:
  name: aso-sample-rg
  namespace: test
spec:
  location: germanywestcentral

To get a better understanding I watched the logs of the azureserviceoperator-controller-manager - and didn't see any when I deleted the resource group via the portal.

kubectl logs -n azureserviceoperator-system azureserviceoperator-controller-manager-86b7779c55-gt6rt -c manager -f

To trigger a reconciliation and hopefully see some log output I updated the ResourceGroup with a label:

apiVersion: microsoft.resources.azure.com/v1alpha1api20200601
kind: ResourceGroup
metadata:
  name: aso-sample-rg
  namespace: test
  labels:
    test: asdf
spec:
  location: germanywestcentral

I got the following log output:

I1115 11:00:17.970212       1 http.go:97] controller-runtime/webhook/webhooks "msg"="received request" "webhook"="/mutate-microsoft-resources-azure-com-v1alpha1api20200601-resourcegroup" "UID"="1ccd80b8-b561-4fae-8cbc-477e74280db1" "kind"={"group":"microsoft.resources.azure.com","version":"v1alpha1api20200601","kind":"ResourceGroup"} "resource"={"group":"microsoft.resources.azure.com","version":"v1alpha1api20200601","resource":"resourcegroups"}
I1115 11:00:17.971869       1 http.go:136] controller-runtime/webhook/webhooks "msg"="wrote response" "code"=200 "reason"="" "webhook"="/mutate-microsoft-resources-azure-com-v1alpha1api20200601-resourcegroup" "UID"="1ccd80b8-b561-4fae-8cbc-477e74280db1" "allowed"=true
I1115 11:00:17.976834       1 generic_controller.go:232] controllers/ResourceGroupController "msg"="Reconcile invoked" "azureName"="aso-sample-rg" "name"="aso-sample-rg" "namespace"="test" "generation"=1 "kind"="*v1alpha1api20200601.ResourceGroup" "resourceVersion"="2832050"
I1115 11:00:17.978195       1 azure_generic_arm_reconciler.go:335] controllers/ResourceGroupController "msg"="DetermineCreateOrUpdateAction" "azureName"="aso-sample-rg" "name"="aso-sample-rg" "namespace"="test" "condition"="Condition [Ready], Status = \"True\", ObservedGeneration = 1, Severity = \"\", Reason = \"Succeeded\", Message = \"\", LastTransitionTime = \"2021-11-15 10:43:16 +0000 UTC\"" "hasChanged"=false "pollerID"="" "resumeToken"=""
I1115 11:00:17.978321       1 azure_generic_arm_reconciler.go:344] controllers/ResourceGroupController "msg"="Nothing to do. Spec has not changed and resource has terminal Ready condition: \"Condition [Ready], Status = \\\"True\\\", ObservedGeneration = 1, Severity = \\\"\\\", Reason = \\\"Succeeded\\\", Message = \\\"\\\", LastTransitionTime = \\\"2021-11-15 10:43:16 +0000 UTC\\\"\"." "azureName"="aso-sample-rg" "name"="aso-sample-rg" "namespace"="test" 
I1115 11:00:17.979516       1 azure_generic_arm_reconciler.go:150] controllers/ResourceGroupController "msg"="Reconciling resource" "azureName"="aso-sample-rg" "name"="aso-sample-rg" "namespace"="test" "action"="NoAction"

The important part are the two last lines where the ResourceGroupController says:

Nothing to do. Spec has not changed and resource has terminal Ready condition
"action"="NoAction"

So, it's not only that the reconciliation does not happen automatically. Even if I force the ResourceGroup to be reconciled it won't check what's actually deployed in Azure, and report in its status that the resource is Ready.

This contradicts the Kubernetes Resource Model which states:

The reported observed state is truth. Controllers are expected to reconcile observed and desired state and repair discrepancies, and Kubernetes avoids maintaining opaque, internal persistent state. Resource status must be reconstructable by observation.

@matthchr I saw that you closed #1263 in favour of the current issue. Still, I wanted to share my findings and thoughts.

At the time of writing, the current issue is in the marked as TODO for the CRD Code Generation project. Are there any plans to tackle this problem soon?

Nov 15 '21 16:11 jomora

Even if I force the ResourceGroup to be reconciled it won't check what's actually deployed in Azure, and report in its status that the resource is Ready.

This is correct. Today the only way to get either ASOv1 or ASOv2 to do what you want is to clear the resource-sig.azure.com annotation and then force a re-reconcile. As you correctly point out, this violates the standard expectation of operators and we need to fix it.

At the time of writing, the current issue is in the marked as TODO for the CRD Code Generation project. Are there any plans to tackle this problem soon?

It depends on what you mean by soon. We currently have this item slated for the beta1 milestone which is post Jan 2022 (no hard date set yet). Actually doing what this issue suggests (an intelligent diff) is difficult to do generically and will take some doing. With that said, it's possible that we can come up with an intermediate solution which gets us closer to where we should be while also being easier - something like just always re-PUT-ing the resource during a subsequent reconcile even the resource hasn't changed. This is inefficient from a resource request throttling perspective, but then we could implement the diffing this issue is talking about and resolve that...

I will bring this up in our planning meeting this week and see what we think about moving at least part of this forward a milestone or two.

Nov 15 '21 18:11 matthchr

I will bring this up in our planning meeting this week and see what we think about moving at least part of this forward a milestone or two

@matthchr do you have some new regarding this topic?

Dec 17 '21 14:12 jomora

@jomora - Was on vacation, sorry for the late update.

I've opened a PR to stop using specHash here: #2022. While this isn't 100% ideal I think it's probably better than we currently have. Assuming that the other maintainers agree with this change, it should be merged for the next release. Once merged, the operator will re-sync with Azure every ~15m by default (configurable).

Jan 04 '22 18:01 matthchr

I agree that PR #2022 is a good start on improving our behaviour. I've made a few comments and have approved the PR.

Jan 10 '22 00:01 theunrepentantgeek

We still want to do this.

Jul 04 '22 22:07 theunrepentantgeek

Some examples of where this is done in other operators:

Nov 03 '22 18:11 matthchr

Closing out as we no longer do a spec hash; see #2811 for follow up.

Mar 27 '23 23:03 theunrepentantgeek

azure-service-operator azure-service-operator copied to clipboard

Reconcile should perform a diff with Azure rather than relying on a spec hash

azure-service-operator
azure-service-operator copied to clipboard