azure-service-operator icon indicating copy to clipboard operation
azure-service-operator copied to clipboard

Bug: #4466 introduced a breaking change

Open reespozzi opened this issue 5 months ago • 12 comments

Describe the bug

https://github.com/Azure/azure-service-operator/pull/4466 was added recently but causes issues for continuous deployment strategies with flux, and is not identified as a breaking change or clearly described in the release notes - simply marked as "allow multiple ASO pods"

Looks like #4466 was because of https://github.com/Azure/azure-service-operator/pull/4445 - which is also something that should be called out in the release notes , as it's also giving us issues when the new pod spins up and tries to select a new leader, as well as fundamentally changing the deployment of ASO -- "supports HA" is very different to "enforces HA"

Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<

It seems you need to have deployment strategy set to Recreate for the HA work (which is enabled by default) to work properly, but this is not possible for us using flux which is trying to use rollingUpdate as the strategy, so this is blocking our upgrade

Azure Service Operator Version: 2.10.0 upgrading to 2.13.0

Expected behavior

We should be able to upgrade as usual

To Reproduce

Flux or similar tool (kustomization here) when applying the update gives this error:

azureserviceoperator-controller-manager dry-run failed (Invalid): Deployment.apps "azureserviceoperator-controller-manager" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'...

It's possible we could make this optional based on something set in values? At least update release notes so others are aware and it's more clearly defined

reespozzi avatar Jun 03 '25 12:06 reespozzi

cc @matthchr / @bingikarthik is it possible to feature flag HA to avoid mandatory consumption and changes? Otherwise this is likely to block us upgrading to ASO latest versions

reespozzi avatar Jun 03 '25 14:06 reespozzi

Can you give us the stack trace for that panic? Independently of whatever decision we make on HA, I'd like to fix that too.

theunrepentantgeek avatar Jun 04 '25 03:06 theunrepentantgeek

Can you give us the stack trace for that panic? Independently of whatever decision we make on HA, I'd like to fix that too.

Sure @theunrepentantgeek, here is the full error, I can't get much more now since I reverted this change to keep things working:

I0603 13:31:47.502361       1 manager.go:340] "Acquiring leader lock..." logger="controllers"
I0603 13:31:47.502440       1 leaderelection.go:257] attempting to acquire leader lease azureserviceoperator-system/controllers-leader-election-azinfra-generated...
E0603 13:33:11.669879       1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
	goroutine 45 [running]:
	k8s.io/apimachinery/pkg/util/runtime.logPanic({0x7b35c78, 0xb58d700}, {0x657b6c0, 0xb4c94b0})
		/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0xbc
	k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x7b35c78, 0xb58d700}, {0x657b6c0, 0xb4c94b0}, {0xb58d700, 0x0, 0x43b305?})
		/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:82 +0x5e
	k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004628c0?})
		/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:59 +0x108
	panic({0x657b6c0?, 0xb4c94b0?})
		/usr/local/go/src/runtime/panic.go:791 +0x132
	github.com/Azure/azure-service-operator/v2/internal/crdmanagement.NewLeaderElector.func2()
		/workspace/internal/crdmanagement/manager.go:101 +0x53
	k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc00032ef00, {0x7b35dd0, 0xc000a22910})
		/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:216 +0x13a
	created by github.com/Azure/azure-service-operator/v2/internal/crdmanagement.(*Manager).applyCRDs in goroutine 1
		/workspace/internal/crdmanagement/manager.go:341 +0x199
 >
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x58d73d3]

reespozzi avatar Jun 04 '25 11:06 reespozzi

also having this issue when upgrading from v10 to v13. I am going to try and change this directly in argoCD to see if I can get around it.

DPatrickBoyd avatar Jun 06 '25 20:06 DPatrickBoyd

also having this issue when upgrading from v10 to v13. I am going to try and change this directly in argoCD to see if I can get around it.

I manually changed the deployment to: strategy: type: Recreate

and then it was able to finish the sync.

DPatrickBoyd avatar Jun 06 '25 20:06 DPatrickBoyd

I've merged a fix for the panic - and @matthchr is going to follow up on the Helm issues.

theunrepentantgeek avatar Jun 09 '25 21:06 theunrepentantgeek

Thanks for that @theunrepentantgeek. @matthchr will that be updated within this thread, or is there somewhere else I need to keep an eye on for updates?

reespozzi avatar Jun 10 '25 08:06 reespozzi

I'll update in this thread (probably later today)

matthchr avatar Jun 10 '25 18:06 matthchr

Ok so there's multiple points brought up here, going to try to tackle them one by one

Item 1:

https://github.com/Azure/azure-service-operator/pull/4466 was added recently but ... is not clearly described in the release notes - simply marked as "allow multiple ASO pods"

The release notes part of this is a fair critique: the release notes entry for this change didn't do a great job saying what it was, and the PR title is misleading. We should have highlighted this more prominently given that it changes the deployment topology, both in terms of number of pods as well as in terms of the deployment mode.

Item 2:

causes issues for continuous deployment strategies with flux, and is not identified as a breaking change

In terms of "is not identified as a breaking change", as far as the Helm chart and kustomizations we maintain, it wasn't a breaking change. It didn't break the default Helm chart configuration (we test upgrades from old to new on this), nor did it break the default kustomize configuration, because we don't set spec.strategy.rollingUpdate - that's something that you must be setting yourselves as I don't think Flux does it automatically. See this comment from Flux maintainers who said:

Flux doesn't make any controlling changes to the manifests that it applies, it is unlikely that the issue is in Flux but more likely that it is an issue in your chart.

My recommendation: Determine what is setting spec.strategy.rollingUpdate and don't set it for this deployment, because this deployment doesn't use rollingUpdate.

Item 3:

"supports HA" is very different to "enforces HA"

We aren't enforcing HA even in 2.13 because you can always set replicas back to 1 if you want. You just always have leader election enabled (already true before in most modes) and always have spec.strategy.type = Recreate. See below for more on this.

My recommendation: If you really just want to run a single replica, I can talk with the other ASO engineers and see what they think about allowing the mode to be set back to Rolling IFF replicas is 1. The mode cannot be rolling if replicas is > 1 (see below), as it does not function correctly in all cases.

Item 4: spec.strategy.type changed from Rolling to Recreate.

We didn't get into great detail on why we did this, let me do that now. For "uptime" as far as the operator is concerned, there are really two things: webhooks and the control loop. The control loop has always been protected by leadership election, and leadership election always takes some time, so there's always downtime for the control loop regardless of what strategy is. It doesn't really matter if the strategy is Rolling or Recreate for the control loop. That's OK though because the control loop having downtime isn't a big deal - it runs intermittently anyway.

Where it matters is for the webhooks. In the old replicas=1 configuration, the defaults were maxUnavailable=0 and maxSurge=1, with Rolling, a new pod would get spun up and start serving webhook traffic before the old pod was terminated. This was great because it (mostly) ensured webhook uptime.

In the new replicas=2 configuration, there's a problem. If we stayed on rollingUpdate, there's a period of time where we have nOld=1 and nNew=1. When this happens, there's a 50% chance that the nNew pod takes the lease and begins writing the new storage version to etcd. A new storage version that the nOld webhooks don't know how to convert from/to. Given that the nNew operator loop relies on the webhooks to convert from the old storage version to the new to run the loop, this can cause issues not only for clients who will see issues where 50% of their requests fail, but the operator itself will see issues with 50% of its requests failing because they hit the old webhook pod rather than the new one.

All this boils down to this reality: In order for conversion webhooks to function properly, they have to be upgraded at the same time, or before, the operator itself. If old conversion webhooks linger after the control loop is already running the new version, it may cause issues for both the operator and clients.

This left us with an unfortunate situation, we could either enable multiple replicas by default, which makes the control loop and webhooks more resilient in steady-state to intermittent pod failures or node failures, but at the cost of seeing possibly weird issues during the upgrade, or we could leave replicas=1 and be more resilient during upgrade but less resilient when in steady state. We chose to be more resilient in steady state by default, figuring that was better than resilience during upgrades.

There's a theoretical world where we could have the best of both worlds, by splitting the webhooks out into their own deployment and then upgrading them first. This is a bit annoying to do in Helm but can be done, but has other consequences with regard to shared caches between webhooks and operators. We didn't pursue it at the time because it felt like overkill pursuing webhook uptime, but if you have significant concerns about the current behavior w.r.t. webhook uptime we'd love to hear them.

matthchr avatar Jun 10 '25 22:06 matthchr

Ok so there's multiple points brought up here, going to try to tackle them one by one

Item 1:

#4466 was added recently but ... is not clearly described in the release notes - simply marked as "allow multiple ASO pods"

The release notes part of this is a fair critique: the release notes entry for this change didn't do a great job saying what it was, and the PR title is misleading. We should have highlighted this more prominently given that it changes the deployment topology, both in terms of number of pods as well as in terms of the deployment mode.

Item 2:

causes issues for continuous deployment strategies with flux, and is not identified as a breaking change

In terms of "is not identified as a breaking change", as far as the Helm chart and kustomizations we maintain, it wasn't a breaking change. It didn't break the default Helm chart configuration (we test upgrades from old to new on this), nor did it break the default kustomize configuration, because we don't set spec.strategy.rollingUpdate - that's something that you must be setting yourselves as I don't think Flux does it automatically. See this comment from Flux maintainers who said:

Flux doesn't make any controlling changes to the manifests that it applies, it is unlikely that the issue is in Flux but more likely that it is an issue in your chart.

My recommendation: Determine what is setting spec.strategy.rollingUpdate and don't set it for this deployment, because this deployment doesn't use rollingUpdate.

Item 3:

"supports HA" is very different to "enforces HA"

We aren't enforcing HA even in 2.13 because you can always set replicas back to 1 if you want. You just always have leader election enabled (already true before in most modes) and always have spec.strategy.type = Recreate. See below for more on this.

My recommendation: If you really just want to run a single replica, I can talk with the other ASO engineers and see what they think about allowing the mode to be set back to Rolling IFF replicas is 1. The mode cannot be rolling if replicas is > 1 (see below), as it does not function correctly in all cases.

Item 4: spec.strategy.type changed from Rolling to Recreate.

We didn't get into great detail on why we did this, let me do that now. For "uptime" as far as the operator is concerned, there are really two things: webhooks and the control loop. The control loop has always been protected by leadership election, and leadership election always takes some time, so there's always downtime for the control loop regardless of what strategy is. It doesn't really matter if the strategy is Rolling or Recreate for the control loop. That's OK though because the control loop having downtime isn't a big deal - it runs intermittently anyway.

Where it matters is for the webhooks. In the old replicas=1 configuration, the defaults were maxUnavailable=0 and maxSurge=1, with Rolling, a new pod would get spun up and start serving webhook traffic before the old pod was terminated. This was great because it (mostly) ensured webhook uptime.

In the new replicas=2 configuration, there's a problem. If we stayed on rollingUpdate, there's a period of time where we have nOld=1 and nNew=1. When this happens, there's a 50% chance that the nNew pod takes the lease and begins writing the new storage version to etcd. A new storage version that the nOld webhooks don't know how to convert from/to. Given that the nNew operator loop relies on the webhooks to convert from the old storage version to the new to run the loop, this can cause issues not only for clients who will see issues where 50% of their requests fail, but the operator itself will see issues with 50% of its requests failing because they hit the old webhook pod rather than the new one.

All this boils down to this reality: In order for conversion webhooks to function properly, they have to be upgraded at the same time, or before, the operator itself. If old conversion webhooks linger after the control loop is already running the new version, it may cause issues for both the operator and clients.

This left us with an unfortunate situation, we could either enable multiple replicas by default, which makes the control loop and webhooks more resilient in steady-state to intermittent pod failures or node failures, but at the cost of seeing possibly weird issues during the upgrade, or we could leave replicas=1 and be more resilient during upgrade but less resilient when in steady state. We chose to be more resilient in steady state by default, figuring that was better than resilience during upgrades.

There's a theoretical world where we could have the best of both worlds, by splitting the webhooks out into their own deployment and then upgrading them first. This is a bit annoying to do in Helm but can be done, but has other consequences with regard to shared caches between webhooks and operators. We didn't pursue it at the time because it felt like overkill pursuing webhook uptime, but if you have significant concerns about the current behavior w.r.t. webhook uptime we'd love to hear them.

Thanks for your in depth reply. This seemed an effectively breaking change for our implementation because the strategy has to be rollingUpdate in our case or the deployment failed, but we received the panic shown above when moving down to 1 replica - so it appeared we had no options to get around it. I take your point about the HA not being enforced, but IMO it's not far off since you must also have leader election and must also have Recreate strategy and it wa s quietly added as a new default when it sounded like an optional add on. I think it should be offered to set this based on consumer configuration where it isn't breaking for the application.

I believe the spec.strategy.rollingUpdate is coming from the default AKS settings; https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy - We haven't defined it in code anywhere, so I'm not sure why this was causing a conflict for us when we tried to patch it to Recreate. Maybe that's a mix of Flux not liking the change when reconciling the Kustomization, will look into that.

On Item 3, it's not a hard requirement for us AFAIK to only use 1 pod, but we were met with the panic shown above when we tried. When we upgrade again this might be different since https://github.com/Azure/azure-service-operator/pull/4762 is added, so we might be better off next time round. If not, we might have to ask about:

My recommendation: If you really just want to run a single replica, I can talk with the other ASO engineers and see what they think about allowing the mode to be set back to Rolling IFF replicas is 1. The mode cannot be rolling if replicas is > 1 (see below), as it does not function correctly in all cases.

What you've said for Item 4 makes sense, and thanks for the extra detail - we use ASO quite heavily for developer workflows but I don't think the change in behaviour here is terrible provided we can find a way to keep consuming the latest version

reespozzi avatar Jun 11 '25 09:06 reespozzi

On Item 3, it's not a hard requirement for us AFAIK to only use 1 pod, but we were met with the panic shown above when we tried. When we upgrade again this might be different since https://github.com/Azure/azure-service-operator/pull/4762 is added, so we might be better off next time round. If not, we might have to ask about:

I played around with this some more and if you try to kubectl edit deployment -n azureserviceoperator-system and just change the strategy.type, without also clearing the strategy.rollingUpdate section which was defaulted there, you'll get an error. Helm doesn't have this issue because it's just applying a totally new YAML (not using server-side apply, AFAIK) that doesn't contain strategy.rollingUpdate.

I think the problem here comes from server-side apply, which is used by flux, not "managing" fields that aren't in the applied resource anymore. See for example this argoCD issue. Though that should ideally only happen if you're migrating from client-side apply to server-side apply as part of this update, which I am not sure if you're doing... A PR from Flux as well that might be related

Things you can try:

  1. Check what the fieldManager is for the strategy.rollingUpdate field and what version of Flux you're using, to see if you can understand why Flux isn't clearing the old field. Assuming the field is being managed by flux, it should be cleared when flux does server-side apply based on my understanding of how server-side apply works.
  2. Manually clear the old field w/ kubectl (may need to change strategy.type too).

If 1 doesn't work I'd be curious to see the managedFields section of your object...

matthchr avatar Jun 11 '25 16:06 matthchr

When we upgrade again this might be different since https://github.com/Azure/azure-service-operator/pull/4762 is added, so we might be better off next time round.

You can test this now using our experimental release.

theunrepentantgeek avatar Jun 11 '25 23:06 theunrepentantgeek