cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard
AzureMachinePool UX: stays in Updating state until CNI is installed
/kind bug
[Before submitting an issue, have you checked the Troubleshooting Guide?]
What steps did you take and what happened: [A clear and concise description of what the bug is.]
Create a cluster with "machinepool" flavor following quickstart instructions:
export WORKER_MACHINE_COUNT=1
clusterctl generate cluster test-mp --flavor machinepool | kubectl apply -f -
Notice that the VMSS becomes ready, and the MachinePoolMachines are in Succeeded state but the AzureMachinePool staus stuck in Updating:
➜ cluster-api-provider-azure git:(main) kubectl get azuremachinepool
NAME REPLICAS READY STATE
test-mp-mp-0 Updating
➜ cluster-api-provider-azure git:(main) kubectl get azuremachinepoolmachines
NAME VERSION READY STATE
test-mp-mp-0-0 v1.24.5 Succeeded
This repros with v1.5.1.
What's interesting is that this is seemingly not reproducing on our e2e tests which are testing release-1.5: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-v1beta1 (double checked that the test waits for the MachinePool ready replicas to be == to the spec replicas, which would timeout above).
What did you expect to happen:
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- cluster-api-provider-azure version: v1.5.1
- Kubernetes version: (use
kubectl version): - OS (e.g. from
/etc/os-release):
/assign
I tried this with main and make tilt-up + the machinepool flavor from Tilt, and it behaved correctly within a few minutes:
% k get azuremachinepool
NAME REPLICAS READY STATE
machinepool-27094-mp-0 2 true Succeeded
% k get azuremachinepoolmachines
NAME VERSION READY STATE
machinepool-27094-mp-0-0 v1.23.9 true Succeeded
machinepool-27094-mp-0-1 v1.23.9 true Succeeded
I'll try again specifically with v1.5.1 and the quickstart route.
I can repro by following the quick start:
% clusterctl init --infrastructure azure
Fetching providers
Installing cert-manager Version="v1.9.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v1.2.3" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v1.2.3" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v1.2.3" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-azure" Version="v1.5.2" TargetNamespace="capz-system"
...
% k get azuremachinepool
NAME REPLICAS READY STATE
test-mp-mp-0 Updating
% k get azuremchinepoolmachines
NAME VERSION READY STATE
test-mp-mp-0-0 v1.24.5 Succeeded
Edit: I think this failed because I hadn't followed through with installing Calico CNI to the workload cluster. In further testing, that seems to be they key.
have you tried with tilt + v1.5.1 tag? Just to know if this is a tilt vs. clusterctl or v1.5.1 vs main branch difference
Machinepool works just fine using make tilt-up in CAPZ with the v1.5.1 tag. Seems to be a clusterctl- or Quick Start-related issue, rather than a change in our code.
The template generated by clusterctl generate cluster test-mp --flavor machinepool is basically identical to that generated by clicking the "machinepool" link in CAPZ Tilt. I just wanted to rule that out as a difference. I'll use the "known working" cluster template for further testing regardless.
I'm seeing this behavior (AzureMachinePoolMachines come up but the AzureMachinePool stays stuck at "updating") if I don't install Calico as recommended for Azure in the Quick Start. Once I install the manifest and Calico starts running, both AMP resource types soon move to READY=true and STATE=Succeeded.
Maybe there's a more informative status we could apply to an AMP in this case?
this is my experience too, without a working CNI the nodes never become ready and so the AMP get stuck
@mboersma - will this be fixed or is fixed with any of your PRs? People shouldn't have to install calico to make it work (i.e. versus Azure CNI) and if we require a CNI provider (even if not Calico), we definitely should document this.
/milestone v1.8
@CecileRobertMichon: The provided milestone is not valid for this repository. Milestones in this repository: [next, v1.9]
Use /milestone clear to clear the milestone.
In response to this:
/milestone v1.8
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/milestone v1.9
/milestone v1.11
/milestone next
@willie-yao: You must be a member of the kubernetes-sigs/cluster-api-provider-azure-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider Azure Maintainers and have them propose you as an additional delegate for this responsibility.
In response to this:
/milestone next
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/unassign /milestone next
I haven't made any progress on this unfortunately and I'm not likely to for this release cycle.
/milestone next
@Jont828: You must be a member of the kubernetes-sigs/cluster-api-provider-azure-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider Azure Maintainers and have them propose you as an additional delegate for this responsibility.
In response to this:
/milestone next
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale