cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard
vsphere-csi-node-xxxxx are in CrashLoopBackOff
/kind bug
What steps did you take and what happened:
- Setup a kind boostrap-cluster to create a 1-control-plane-node and 3-worker node cluster on my vSphere account.
- I am using Ubuntu 22.04 OVA by VMWare.
- On
K applyI can see the VMs being created on my vSphere account. - I installed Calico as instructed using these instructions: https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises (Because the machines don't have full internet access from the onpremise environment)
What I see on the provisioned cluster
- Some calico pods are in pending state
- Some coredns pods are in pending state
- vsphere-csi-controller-manager pod is in pending state
- vsphere-csi-node-xxxxx are in CrashLoopBackOff without much information
- There is NO log of what error has occurred. I checked logs in CAPI and CAPV pods in the bootstrap cluster. There is NO error in the provisioned cluster's pods as well.
What did you expect to happen: I expected to see a cluster with all pods running.
Anything else you would like to add:
Below are some of the K output for reference.
Here are some of the env variables I have
# VSPHERE_TEMPLATE: "ubuntu-2204-kube-v1.27.3"
# CONTROL_PLANE_ENDPOINT_IP: "10.63.32.100"
# VIP_NETWORK_INTERFACE: "ens192"
# VSPHERE_TLS_THUMBPRINT: ""
# EXP_CLUSTER_RESOURCE_SET: true
# VSPHERE_SSH_AUTHORIZED_KEY: ""
# VSPHERE_STORAGE_POLICY: ""
# CPI_IMAGE_K8S_VERSION: "v1.27.3"
All bootstrap pods are running without errors.
ubuntu@frun10926:~/k8s$ kubectl get po -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-557b778d6b-qpxn7 1/1 Running 1 (24h ago) 2d22h 10.244.0.9 kind-control-plane <none> <none>
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-55d8f6b576-8hl5r 1/1 Running 1 (24h ago) 2d22h 10.244.0.10 kind-control-plane <none> <none>
capi-system capi-controller-manager-685454967c-tnmcj 1/1 Running 3 (24h ago) 2d22h 10.244.0.8 kind-control-plane <none> <none>
capv-system capv-controller-manager-84d85cdcbd-cb2wp 1/1 Running 3 (24h ago) 2d22h 10.244.0.11 kind-control-plane <none> <none>
cert-manager cert-manager-75d57c8d4b-7j4tk 1/1 Running 1 (24h ago) 2d22h 10.244.0.6 kind-control-plane <none> <none>
cert-manager cert-manager-cainjector-69d6f4d488-rvp67 1/1 Running 2 (24h ago) 2d22h 10.244.0.5 kind-control-plane <none> <none>
cert-manager cert-manager-webhook-869b6c65c4-h6xdt 1/1 Running 0 2d22h 10.244.0.7 kind-control-plane <none> <none>
kube-system coredns-5d78c9869d-djj9s 1/1 Running 0 2d22h 10.244.0.4 kind-control-plane <none> <none>
kube-system coredns-5d78c9869d-vltjl 1/1 Running 0 2d22h 10.244.0.3 kind-control-plane <none> <none>
kube-system etcd-kind-control-plane 1/1 Running 0 2d22h 172.18.0.2 kind-control-plane <none> <none>
kube-system kindnet-zp6c5 1/1 Running 1 (24h ago) 2d22h 172.18.0.2 kind-control-plane <none> <none>
kube-system kube-apiserver-kind-control-plane 1/1 Running 1 (24h ago) 2d22h 172.18.0.2 kind-control-plane <none> <none>
kube-system kube-controller-manager-kind-control-plane 1/1 Running 1 (24h ago) 2d22h 172.18.0.2 kind-control-plane <none> <none>
kube-system kube-proxy-t2g5b 1/1 Running 0 2d22h 172.18.0.2 kind-control-plane <none> <none>
kube-system kube-scheduler-kind-control-plane 1/1 Running 1 (24h ago) 2d22h 172.18.0.2 kind-control-plane <none> <none>
local-path-storage local-path-provisioner-6bc4bddd6b-rkwwm 1/1 Running 0 2d22h 10.244.0.2 kind-control-plane <none> <none>
Here are the pods on the vSphere cluster that was provisioned using CAPI
ubuntu@frun10926:~/k8s$ kubectl get po -A --kubeconfig=mcluster.kubeconfig -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-system calico-kube-controllers-5f9d445bb4-hp7rt 0/1 Pending 0 2d20h <none> <none> <none> <none>
calico-system calico-node-6mrpv 1/1 Running 0 2d20h 10.63.32.83 mcluster-md-0-4kxmk-zplmd <none> <none>
calico-system calico-node-dg42m 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
calico-system calico-node-f6n9r 1/1 Running 0 2d20h 10.63.32.81 mcluster-md-0-4kxmk-wfscb <none> <none>
calico-system calico-node-gtxcg 1/1 Running 0 2d20h 10.63.32.82 mcluster-md-0-4kxmk-gbcjj <none> <none>
calico-system calico-typha-5b866db66c-sdnpv 1/1 Running 0 2d20h 10.63.32.81 mcluster-md-0-4kxmk-wfscb <none> <none>
calico-system calico-typha-5b866db66c-trwlj 1/1 Running 0 2d20h 10.63.32.82 mcluster-md-0-4kxmk-gbcjj <none> <none>
calico-system csi-node-driver-drblt 2/2 Running 0 2d20h 192.168.232.193 mcluster-klljm <none> <none>
calico-system csi-node-driver-pbhvm 2/2 Running 0 2d20h 192.168.68.65 mcluster-md-0-4kxmk-zplmd <none> <none>
calico-system csi-node-driver-vflj4 2/2 Running 0 2d20h 192.168.141.66 mcluster-md-0-4kxmk-gbcjj <none> <none>
calico-system csi-node-driver-wzmtr 2/2 Running 0 2d20h 192.168.83.65 mcluster-md-0-4kxmk-wfscb <none> <none>
kube-system coredns-5d78c9869d-ckdjb 0/1 Pending 0 2d20h <none> <none> <none> <none>
kube-system coredns-5d78c9869d-vlpkw 0/1 Pending 0 2d20h <none> <none> <none> <none>
kube-system etcd-mcluster-klljm 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
kube-system kube-apiserver-mcluster-klljm 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
kube-system kube-controller-manager-mcluster-klljm 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
kube-system kube-proxy-7dxb2 1/1 Running 0 2d20h 10.63.32.82 mcluster-md-0-4kxmk-gbcjj <none> <none>
kube-system kube-proxy-gsgzz 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
kube-system kube-proxy-mp98t 1/1 Running 0 2d20h 10.63.32.83 mcluster-md-0-4kxmk-zplmd <none> <none>
kube-system kube-proxy-x97w4 1/1 Running 0 2d20h 10.63.32.81 mcluster-md-0-4kxmk-wfscb <none> <none>
kube-system kube-scheduler-mcluster-klljm 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
kube-system kube-vip-mcluster-klljm 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
kube-system vsphere-cloud-controller-manager-hzvzj 1/1 Running 0 2d20h 10.63.32.84 mcluster-klljm <none> <none>
kube-system vsphere-csi-controller-664c45f69b-6ddz4 0/5 Pending 0 2d20h <none> <none> <none> <none>
kube-system vsphere-csi-node-dtvrg 2/3 CrashLoopBackOff 809 (3m57s ago) 2d20h 192.168.141.65 mcluster-md-0-4kxmk-gbcjj <none> <none>
kube-system vsphere-csi-node-jcpxj 2/3 CrashLoopBackOff 810 (73s ago) 2d20h 192.168.232.194 mcluster-klljm <none> <none>
kube-system vsphere-csi-node-lpjxj 2/3 CrashLoopBackOff 809 (2m22s ago) 2d20h 192.168.83.66 mcluster-md-0-4kxmk-wfscb <none> <none>
kube-system vsphere-csi-node-nkh6m 2/3 CrashLoopBackOff 809 (3m35s ago) 2d20h 192.168.68.66 mcluster-md-0-4kxmk-zplmd <none> <none>
tigera-operator tigera-operator-84cf9b6dbb-w6lkf 1/1 Running 0 2d20h 10.63.32.83 mcluster-md-0-4kxmk-zplmd <none> <none>
Here is a sample k describe for vsphere-csi-node-xxxx
ubuntu@frun10926:~/k8s$ kubectl describe pod vsphere-csi-node-dtvrg -n kube-system --kubeconfig=mcluster.kubeconfig
Name: vsphere-csi-node-dtvrg
Namespace: kube-system
Priority: 0
Service Account: default
Node: mcluster-md-0-4kxmk-gbcjj/10.63.32.82
Start Time: Fri, 24 Nov 2023 19:14:52 +0000
Labels: app=vsphere-csi-node
controller-revision-hash=69967bd89d
pod-template-generation=1
role=vsphere-csi
Annotations: cni.projectcalico.org/containerID: 0e30215c3f275ce821e98584c24cd139273c8c061af590ef5ddeb915b421e6ec
cni.projectcalico.org/podIP: 192.168.141.65/32
cni.projectcalico.org/podIPs: 192.168.141.65/32
Status: Running
IP: 192.168.141.65
IPs:
IP: 192.168.141.65
Controlled By: DaemonSet/vsphere-csi-node
Containers:
node-driver-registrar:
Container ID: containerd://075a9e6aa183294562e6edfbd55577f8eeca891c19cb43603973a1057d2f8125
Image: quay.io/k8scsi/csi-node-driver-registrar:v2.0.1
Image ID: quay.io/k8scsi/csi-node-driver-registrar@sha256:a104f0f0ec5fdd007a4a85ffad95a93cfb73dd7e86296d3cc7846fde505248d3
Port: <none>
Host Port: <none>
Args:
--v=5
--csi-address=$(ADDRESS)
--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
State: Running
Started: Fri, 24 Nov 2023 19:31:30 +0000
Ready: True
Restart Count: 0
Environment:
ADDRESS: /csi/csi.sock
DRIVER_REG_SOCK_PATH: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
Mounts:
/csi from plugin-dir (rw)
/registration from registration-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
vsphere-csi-node:
Container ID: containerd://b8ec60cc34ad576e31564f0d993b2b50440f8de2753f744c545cb772407ee654
Image: gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.2
Image ID: gcr.io/cloud-provider-vsphere/csi/release/driver@sha256:471db9143b6daf2abdb656383f9d7ad34123a22c163c3f0e62dc8921048566bb
Port: 9808/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 27 Nov 2023 15:56:46 +0000
Finished: Mon, 27 Nov 2023 15:56:46 +0000
Ready: False
Restart Count: 807
Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=5s #success=1 #failure=3
Environment:
CSI_ENDPOINT: unix:///csi/csi.sock
X_CSI_MODE: node
X_CSI_SPEC_REQ_VALIDATION: false
VSPHERE_CSI_CONFIG: /etc/cloud/csi-vsphere.conf
LOGGER_LEVEL: PRODUCTION
X_CSI_LOG_LEVEL: INFO
NODE_NAME: (v1:spec.nodeName)
Mounts:
/csi from plugin-dir (rw)
/dev from device-dir (rw)
/etc/cloud from vsphere-config-volume (rw)
/var/lib/kubelet from pods-mount-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
liveness-probe:
Container ID: containerd://3ccf0d77472d57ac853a20305fd7862c97163b2509e40977cdc735e26b21665a
Image: quay.io/k8scsi/livenessprobe:v2.1.0
Image ID: quay.io/k8scsi/livenessprobe@sha256:04a9c4a49de1bd83d21e962122da2ac768f356119fb384660aa33d93183996c3
Port: <none>
Host Port: <none>
Args:
--csi-address=/csi/csi.sock
State: Running
Started: Fri, 24 Nov 2023 19:31:54 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/csi from plugin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
vsphere-config-volume:
Type: Secret (a volume populated by a Secret)
SecretName: csi-vsphere-config
Optional: false
registration-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins_registry
HostPathType: Directory
plugin-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/
HostPathType: DirectoryOrCreate
pods-mount-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet
HostPathType: Directory
device-dir:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
kube-api-access-glb6m:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning DNSConfigForming 28s (x20490 over 2d20h) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.242.46.35 10.242.46.36 10.250.46.36
Environment:
- Cluster-api-provider-vsphere version: 1.5.3
- Kubernetes version: (use
kubectl version): 1.27.3 - OS (e.g. from
/etc/os-release): Ubuntu 22.04 OVA image that vSphere recommends (with no changes to the OVA).
Could you take a look on why vsphere-csi-controller-664c45f69b-6ddz4 is in Pending? (via kubectl describe pod)?
If I got it right this pod needs to be up first so the daemonset pods can succeed.
Did you use the default templates provided by CAPV or did you manually deploy CSI?
I posted the sample output from the kubectl describe <pod> above.
I used the default template and followed instructions from the quick-start page to generate cluster yaml file.
I am not using the yaml files from the templates folder.
So something prevents the vsphere-csi-controller from getting scheduled. There may be taints or something else why this happens.
You need to figure out why that is and then the daemonset pods should also get ready.
can you get the events from that namespace?
The csi-node-driver, which is installed using tigera-operator, conflicts with vsphere-csi-node. I couldn't disable the installation of csi-node-driver, so I use kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml
Would be interesting to figure out together with https://github.com/kubernetes/cloud-provider-vsphere where the gaps are that both can run at the same time. (for CSI we simply consume the above).
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.