cluster-api-provider-azure icon indicating copy to clipboard operation
cluster-api-provider-azure copied to clipboard

VMExtensionProvisioningError with capi-ubun2-2204 and capi-ubun2-2404

Open thiDucTran opened this issue 9 months ago • 6 comments

/kind bug

What steps did you take and what happened:

  1. clusterctl generate cluster capz-private --infrastructure azure --kubernetes-version v1.31.4 --control-plane-machine-count=1 --worker-machine-count=1
  2. Then I add on the below to the generated yaml file
      image:
        computeGallery:
          gallery: "ClusterAPI-f72ceb4f-5159-4c26-a0fe-2ea738f0d019"
          name: "capi-ubun2-2404"
          version: "1.31.4"
networkSpec:
    privateDNSZoneName: "capz-private.capz.io"
    apiServerLB:
      type: Internal
      frontendIPs:
      - name: lb-private-ip-frontend
        privateIP: 172.19.2.100
    controlPlaneOutboundLB:
      frontendIPsCount: 1
    nodeOutboundLB:
      frontendIPsCount: 1
    subnets:
    - name: control-plane-subnet
      role: control-plane
      cidrBlocks:
      - 172.19.2.0/25
    - name: node-subnet
      role: node
      cidrBlocks:
        - 172.19.2.128/25
    vnet:
      name: vnet-capz-private
      cidrBlocks:
      - 172.19.2.0/24
      peerings:
      - resourceGroup: rg-name
        remoteVnetName: vnet-capi-mgmt-stage
  1. The cluster is Provisioned but: I see 2 errors (below), KubeAdmControlPlane not initialized
E0116 16:26:38.836076       1 cluster_accessor.go:257] "Connect failed" err="error creating HTTP client and mapper: cluster is not reachable: Get \"https://apiserver.capz-private.capz.io:6443/?timeout=5s\": context deadline exceeded" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="default/capz-private" namespace="default" name="capz-private" reconcileID="4ec298dc-f392-4809-9426-d35cf8bfa55d"         
E0116 16:31:31.645375       1 controller.go:324] "Reconciler error" err=<failed to reconcile AzureMachine: failed to reconcile AzureMachine service vmextensions: extension state failed. This likely means the Kubernetes node bootstrapping process failed or timed out. Check VM boot diagnostics logs to learn more: failed to create or update resource thi-poc-capz-private/CAPZ.Linux.Bootstrapping (service: vmextensions): GET https://management.azure.com/subscriptions/9d54088e-89a6-4a3a-8f91-05699ba3c93a/providers/Microsoft.Compute/locations/northcentralus/operations 6921441d-f760-44e3-8db1-b7aa18c05a5e --------------------------------------------------------------------------------                                                            RESPONSE 200: 200 OK
ERROR CODE: VMExtensionProvisioningError  --------------------------------------------------------------------------------                                                                                                                                                                                                 "code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'CAPZ.Linux.Bootstrapping' (publisher 'Microsoft.Azure.ContainerUpstream' and type 'CAPZ.Linux.Bootstrapping'). Error message: 'Enable failed: failed to execute command: command terminated with exit status=1\n[stdout]\n\n[stderr]\n'. More information on troubleshooting is available at https://aka.ms/vmextensionlinuxtroubleshoot.

What did you expect to happen:

  1. KubeAdmControlPlane gets initialized, CAPZ.Linux.Bootstrapping executes successfully

Anything else you would like to add:

  • The logs mentioned in https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/features-linux?tabs=azure-cli#troubleshoot-vm-extensions does not give any additional info..at least to me it doesnt..i can share it if requested
  • VNET, Subnets, and Network security groups do not pre-exist..CAPI creates them
  • Even though the bootstrapping does not execute successfully, I can do clusterctl get kubeconfig and do kubectl get po to prove that the API server is functioning and there is not network connectivity issue
  • Compared to the cluster that I don't specify the API SERVER LB as type: Internal, I noticed that kube-proxy is missing..may be that is related to the boostrapping
  • Using the same node image, I can create and have a functioning cluster if I don't specify the API SERVER LB as type: Internal

Environment:

  • cluster-api-provider-azure version image
  • Kubernetes version: (use kubectl version): v1.30.6
  • OS (e.g. from /etc/os-release): 20.04.6 LTS (Focal Fossa)
  • clusterctl: v1.9.3

thiDucTran avatar Jan 16 '25 18:01 thiDucTran