cluster-api-provider-azure AzureMachinePool UX: stays in Updating state until CNI is installed

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Create a cluster with "machinepool" flavor following quickstart instructions:

export WORKER_MACHINE_COUNT=1 clusterctl generate cluster test-mp --flavor machinepool | kubectl apply -f -

Notice that the VMSS becomes ready, and the MachinePoolMachines are in Succeeded state but the AzureMachinePool staus stuck in Updating:

➜  cluster-api-provider-azure git:(main) kubectl get azuremachinepool                              
NAME           REPLICAS   READY   STATE
test-mp-mp-0                      Updating
➜  cluster-api-provider-azure git:(main) kubectl get azuremachinepoolmachines
NAME             VERSION   READY   STATE
test-mp-mp-0-0   v1.24.5           Succeeded

This repros with v1.5.1.

What's interesting is that this is seemingly not reproducing on our e2e tests which are testing release-1.5: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-v1beta1 (double checked that the test waits for the MachinePool ready replicas to be == to the spec replicas, which would timeout above).

What did you expect to happen:

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version: v1.5.1
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

Oct 12 '22 00:10 CecileRobertMichon

/assign

Oct 12 '22 17:10 mboersma

I tried this with main and make tilt-up + the machinepool flavor from Tilt, and it behaved correctly within a few minutes:

% k get azuremachinepool
NAME                     REPLICAS   READY   STATE
machinepool-27094-mp-0   2          true    Succeeded
% k get azuremachinepoolmachines
NAME                       VERSION   READY   STATE
machinepool-27094-mp-0-0   v1.23.9   true    Succeeded
machinepool-27094-mp-0-1   v1.23.9   true    Succeeded

I'll try again specifically with v1.5.1 and the quickstart route.

Oct 12 '22 17:10 mboersma

I can repro by following the quick start:

% clusterctl init --infrastructure azure
Fetching providers
Installing cert-manager Version="v1.9.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v1.2.3" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v1.2.3" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v1.2.3" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-azure" Version="v1.5.2" TargetNamespace="capz-system"
...
% k get azuremachinepool        
NAME           REPLICAS   READY   STATE
test-mp-mp-0                      Updating
% k get azuremchinepoolmachines
NAME             VERSION   READY   STATE
test-mp-mp-0-0   v1.24.5           Succeeded

Edit: I think this failed because I hadn't followed through with installing Calico CNI to the workload cluster. In further testing, that seems to be they key.

Oct 12 '22 19:10 mboersma

have you tried with tilt + v1.5.1 tag? Just to know if this is a tilt vs. clusterctl or v1.5.1 vs main branch difference

Oct 12 '22 22:10 CecileRobertMichon

Machinepool works just fine using make tilt-up in CAPZ with the v1.5.1 tag. Seems to be a clusterctl- or Quick Start-related issue, rather than a change in our code.

Oct 13 '22 14:10 mboersma

The template generated by clusterctl generate cluster test-mp --flavor machinepool is basically identical to that generated by clicking the "machinepool" link in CAPZ Tilt. I just wanted to rule that out as a difference. I'll use the "known working" cluster template for further testing regardless.

Oct 13 '22 15:10 mboersma

I'm seeing this behavior (AzureMachinePoolMachines come up but the AzureMachinePool stays stuck at "updating") if I don't install Calico as recommended for Azure in the Quick Start. Once I install the manifest and Calico starts running, both AMP resource types soon move to READY=true and STATE=Succeeded.

Maybe there's a more informative status we could apply to an AMP in this case?

Oct 13 '22 19:10 mboersma

this is my experience too, without a working CNI the nodes never become ready and so the AMP get stuck

Dec 29 '22 11:12 primeroz

@mboersma - will this be fixed or is fixed with any of your PRs? People shouldn't have to install calico to make it work (i.e. versus Azure CNI) and if we require a CNI provider (even if not Calico), we definitely should document this.

Jan 04 '23 00:01 dtzar

/milestone v1.8

Mar 10 '23 20:03 CecileRobertMichon

@CecileRobertMichon: The provided milestone is not valid for this repository. Milestones in this repository: [next, v1.9]

Use /milestone clear to clear the milestone.

In response to this:

/milestone v1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 10 '23 20:03 k8s-ci-robot

/milestone v1.9

Mar 16 '23 16:03 mboersma

/milestone v1.11

Jul 20 '23 16:07 mboersma

/milestone next

Aug 17 '23 16:08 willie-yao

@willie-yao: You must be a member of the kubernetes-sigs/cluster-api-provider-azure-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider Azure Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 17 '23 16:08 k8s-ci-robot

/unassign /milestone next

I haven't made any progress on this unfortunately and I'm not likely to for this release cycle.

Aug 17 '23 16:08 mboersma

/milestone next

Nov 02 '23 16:11 Jont828

@Jont828: You must be a member of the kubernetes-sigs/cluster-api-provider-azure-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider Azure Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 02 '23 16:11 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 14 '24 18:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 15 '24 19:03 k8s-triage-robot

/remove-lifecycle rotten

Mar 19 '24 17:03 willie-yao

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 17 '24 18:06 k8s-triage-robot

/remove-lifecycle stale

Jun 17 '24 23:06 willie-yao

cluster-api-provider-azure cluster-api-provider-azure copied to clipboard

AzureMachinePool UX: stays in Updating state until CNI is installed

cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard