cluster-api-provider-vsphere icon indicating copy to clipboard operation
cluster-api-provider-vsphere copied to clipboard

vsphere-csi-node-xxxxx are in CrashLoopBackOff

Open dattebayo6716 opened this issue 1 year ago • 7 comments
trafficstars

/kind bug

What steps did you take and what happened:

  • Setup a kind boostrap-cluster to create a 1-control-plane-node and 3-worker node cluster on my vSphere account.
  • I am using Ubuntu 22.04 OVA by VMWare.
  • On K apply I can see the VMs being created on my vSphere account.
  • I installed Calico as instructed using these instructions: https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises (Because the machines don't have full internet access from the onpremise environment)

What I see on the provisioned cluster

  1. Some calico pods are in pending state
  2. Some coredns pods are in pending state
  3. vsphere-csi-controller-manager pod is in pending state
  4. vsphere-csi-node-xxxxx are in CrashLoopBackOff without much information
  5. There is NO log of what error has occurred. I checked logs in CAPI and CAPV pods in the bootstrap cluster. There is NO error in the provisioned cluster's pods as well.

What did you expect to happen: I expected to see a cluster with all pods running.

Anything else you would like to add: Below are some of the K output for reference.

Here are some of the env variables I have

# VSPHERE_TEMPLATE: "ubuntu-2204-kube-v1.27.3"
# CONTROL_PLANE_ENDPOINT_IP: "10.63.32.100"
# VIP_NETWORK_INTERFACE: "ens192"
# VSPHERE_TLS_THUMBPRINT: ""
# EXP_CLUSTER_RESOURCE_SET: true  
# VSPHERE_SSH_AUTHORIZED_KEY: ""

# VSPHERE_STORAGE_POLICY: ""
# CPI_IMAGE_K8S_VERSION: "v1.27.3"

All bootstrap pods are running without errors.

ubuntu@frun10926:~/k8s$ kubectl get po -A -o wide
NAMESPACE                           NAME                                                             READY   STATUS    RESTARTS      AGE     IP            NODE                 NOMINATED NODE   READINESS GATES
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-557b778d6b-qpxn7       1/1     Running   1 (24h ago)   2d22h   10.244.0.9    kind-control-plane   <none>           <none>
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-55d8f6b576-8hl5r   1/1     Running   1 (24h ago)   2d22h   10.244.0.10   kind-control-plane   <none>           <none>
capi-system                         capi-controller-manager-685454967c-tnmcj                         1/1     Running   3 (24h ago)   2d22h   10.244.0.8    kind-control-plane   <none>           <none>
capv-system                         capv-controller-manager-84d85cdcbd-cb2wp                         1/1     Running   3 (24h ago)   2d22h   10.244.0.11   kind-control-plane   <none>           <none>
cert-manager                        cert-manager-75d57c8d4b-7j4tk                                    1/1     Running   1 (24h ago)   2d22h   10.244.0.6    kind-control-plane   <none>           <none>
cert-manager                        cert-manager-cainjector-69d6f4d488-rvp67                         1/1     Running   2 (24h ago)   2d22h   10.244.0.5    kind-control-plane   <none>           <none>
cert-manager                        cert-manager-webhook-869b6c65c4-h6xdt                            1/1     Running   0             2d22h   10.244.0.7    kind-control-plane   <none>           <none>
kube-system                         coredns-5d78c9869d-djj9s                                         1/1     Running   0             2d22h   10.244.0.4    kind-control-plane   <none>           <none>
kube-system                         coredns-5d78c9869d-vltjl                                         1/1     Running   0             2d22h   10.244.0.3    kind-control-plane   <none>           <none>
kube-system                         etcd-kind-control-plane                                          1/1     Running   0             2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kindnet-zp6c5                                                    1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-apiserver-kind-control-plane                                1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-controller-manager-kind-control-plane                       1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-proxy-t2g5b                                                 1/1     Running   0             2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-scheduler-kind-control-plane                                1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
local-path-storage                  local-path-provisioner-6bc4bddd6b-rkwwm                          1/1     Running   0             2d22h   10.244.0.2    kind-control-plane   <none>           <none>

Here are the pods on the vSphere cluster that was provisioned using CAPI

ubuntu@frun10926:~/k8s$ kubectl get po -A --kubeconfig=mcluster.kubeconfig -o wide
NAMESPACE         NAME                                       READY   STATUS             RESTARTS          AGE     IP                NODE                        NOMINATED NODE   READINESS GATES
calico-system     calico-kube-controllers-5f9d445bb4-hp7rt   0/1     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
calico-system     calico-node-6mrpv                          1/1     Running            0                 2d20h   10.63.32.83       mcluster-md-0-4kxmk-zplmd   <none>           <none>
calico-system     calico-node-dg42m                          1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
calico-system     calico-node-f6n9r                          1/1     Running            0                 2d20h   10.63.32.81       mcluster-md-0-4kxmk-wfscb   <none>           <none>
calico-system     calico-node-gtxcg                          1/1     Running            0                 2d20h   10.63.32.82       mcluster-md-0-4kxmk-gbcjj   <none>           <none>
calico-system     calico-typha-5b866db66c-sdnpv              1/1     Running            0                 2d20h   10.63.32.81       mcluster-md-0-4kxmk-wfscb   <none>           <none>
calico-system     calico-typha-5b866db66c-trwlj              1/1     Running            0                 2d20h   10.63.32.82       mcluster-md-0-4kxmk-gbcjj   <none>           <none>
calico-system     csi-node-driver-drblt                      2/2     Running            0                 2d20h   192.168.232.193   mcluster-klljm              <none>           <none>
calico-system     csi-node-driver-pbhvm                      2/2     Running            0                 2d20h   192.168.68.65     mcluster-md-0-4kxmk-zplmd   <none>           <none>
calico-system     csi-node-driver-vflj4                      2/2     Running            0                 2d20h   192.168.141.66    mcluster-md-0-4kxmk-gbcjj   <none>           <none>
calico-system     csi-node-driver-wzmtr                      2/2     Running            0                 2d20h   192.168.83.65     mcluster-md-0-4kxmk-wfscb   <none>           <none>
kube-system       coredns-5d78c9869d-ckdjb                   0/1     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
kube-system       coredns-5d78c9869d-vlpkw                   0/1     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
kube-system       etcd-mcluster-klljm                        1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-apiserver-mcluster-klljm              1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-controller-manager-mcluster-klljm     1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-proxy-7dxb2                           1/1     Running            0                 2d20h   10.63.32.82       mcluster-md-0-4kxmk-gbcjj   <none>           <none>
kube-system       kube-proxy-gsgzz                           1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-proxy-mp98t                           1/1     Running            0                 2d20h   10.63.32.83       mcluster-md-0-4kxmk-zplmd   <none>           <none>
kube-system       kube-proxy-x97w4                           1/1     Running            0                 2d20h   10.63.32.81       mcluster-md-0-4kxmk-wfscb   <none>           <none>
kube-system       kube-scheduler-mcluster-klljm              1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-vip-mcluster-klljm                    1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       vsphere-cloud-controller-manager-hzvzj     1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       vsphere-csi-controller-664c45f69b-6ddz4    0/5     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
kube-system       vsphere-csi-node-dtvrg                     2/3     CrashLoopBackOff   809 (3m57s ago)   2d20h   192.168.141.65    mcluster-md-0-4kxmk-gbcjj   <none>           <none>
kube-system       vsphere-csi-node-jcpxj                     2/3     CrashLoopBackOff   810 (73s ago)     2d20h   192.168.232.194   mcluster-klljm              <none>           <none>
kube-system       vsphere-csi-node-lpjxj                     2/3     CrashLoopBackOff   809 (2m22s ago)   2d20h   192.168.83.66     mcluster-md-0-4kxmk-wfscb   <none>           <none>
kube-system       vsphere-csi-node-nkh6m                     2/3     CrashLoopBackOff   809 (3m35s ago)   2d20h   192.168.68.66     mcluster-md-0-4kxmk-zplmd   <none>           <none>
tigera-operator   tigera-operator-84cf9b6dbb-w6lkf           1/1     Running            0                 2d20h   10.63.32.83       mcluster-md-0-4kxmk-zplmd   <none>           <none>

Here is a sample k describe for vsphere-csi-node-xxxx

ubuntu@frun10926:~/k8s$ kubectl describe pod  vsphere-csi-node-dtvrg -n kube-system --kubeconfig=mcluster.kubeconfig
Name:             vsphere-csi-node-dtvrg
Namespace:        kube-system
Priority:         0
Service Account:  default
Node:             mcluster-md-0-4kxmk-gbcjj/10.63.32.82
Start Time:       Fri, 24 Nov 2023 19:14:52 +0000
Labels:           app=vsphere-csi-node
                  controller-revision-hash=69967bd89d
                  pod-template-generation=1
                  role=vsphere-csi
Annotations:      cni.projectcalico.org/containerID: 0e30215c3f275ce821e98584c24cd139273c8c061af590ef5ddeb915b421e6ec
                  cni.projectcalico.org/podIP: 192.168.141.65/32
                  cni.projectcalico.org/podIPs: 192.168.141.65/32
Status:           Running
IP:               192.168.141.65
IPs:
  IP:           192.168.141.65
Controlled By:  DaemonSet/vsphere-csi-node
Containers:
  node-driver-registrar:
    Container ID:  containerd://075a9e6aa183294562e6edfbd55577f8eeca891c19cb43603973a1057d2f8125
    Image:         quay.io/k8scsi/csi-node-driver-registrar:v2.0.1
    Image ID:      quay.io/k8scsi/csi-node-driver-registrar@sha256:a104f0f0ec5fdd007a4a85ffad95a93cfb73dd7e86296d3cc7846fde505248d3
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=$(ADDRESS)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    State:          Running
      Started:      Fri, 24 Nov 2023 19:31:30 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      ADDRESS:               /csi/csi.sock
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
  vsphere-csi-node:
    Container ID:   containerd://b8ec60cc34ad576e31564f0d993b2b50440f8de2753f744c545cb772407ee654
    Image:          gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.2
    Image ID:       gcr.io/cloud-provider-vsphere/csi/release/driver@sha256:471db9143b6daf2abdb656383f9d7ad34123a22c163c3f0e62dc8921048566bb
    Port:           9808/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 27 Nov 2023 15:56:46 +0000
      Finished:     Mon, 27 Nov 2023 15:56:46 +0000
    Ready:          False
    Restart Count:  807
    Liveness:       http-get http://:healthz/healthz delay=10s timeout=3s period=5s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:               unix:///csi/csi.sock
      X_CSI_MODE:                 node
      X_CSI_SPEC_REQ_VALIDATION:  false
      VSPHERE_CSI_CONFIG:         /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:               PRODUCTION
      X_CSI_LOG_LEVEL:            INFO
      NODE_NAME:                   (v1:spec.nodeName)
    Mounts:
      /csi from plugin-dir (rw)
      /dev from device-dir (rw)
      /etc/cloud from vsphere-config-volume (rw)
      /var/lib/kubelet from pods-mount-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
  liveness-probe:
    Container ID:  containerd://3ccf0d77472d57ac853a20305fd7862c97163b2509e40977cdc735e26b21665a
    Image:         quay.io/k8scsi/livenessprobe:v2.1.0
    Image ID:      quay.io/k8scsi/livenessprobe@sha256:04a9c4a49de1bd83d21e962122da2ac768f356119fb384660aa33d93183996c3
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=/csi/csi.sock
    State:          Running
      Started:      Fri, 24 Nov 2023 19:31:54 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /csi from plugin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  csi-vsphere-config
    Optional:    false
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry
    HostPathType:  Directory
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/csi.vsphere.vmware.com/
    HostPathType:  DirectoryOrCreate
  pods-mount-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
  device-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  kube-api-access-glb6m:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason            Age                      From     Message
  ----     ------            ----                     ----     -------
  Warning  DNSConfigForming  28s (x20490 over 2d20h)  kubelet  Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.242.46.35 10.242.46.36 10.250.46.36

Environment:

  • Cluster-api-provider-vsphere version: 1.5.3
  • Kubernetes version: (use kubectl version): 1.27.3
  • OS (e.g. from /etc/os-release): Ubuntu 22.04 OVA image that vSphere recommends (with no changes to the OVA).

dattebayo6716 avatar Nov 27 '23 19:11 dattebayo6716

Could you take a look on why vsphere-csi-controller-664c45f69b-6ddz4 is in Pending? (via kubectl describe pod)?

If I got it right this pod needs to be up first so the daemonset pods can succeed.

Did you use the default templates provided by CAPV or did you manually deploy CSI?

chrischdi avatar Dec 06 '23 09:12 chrischdi

I posted the sample output from the kubectl describe <pod> above.

I used the default template and followed instructions from the quick-start page to generate cluster yaml file. I am not using the yaml files from the templates folder.

dattebayo6716 avatar Dec 14 '23 03:12 dattebayo6716

So something prevents the vsphere-csi-controller from getting scheduled. There may be taints or something else why this happens.

You need to figure out why that is and then the daemonset pods should also get ready.

chrischdi avatar Dec 14 '23 08:12 chrischdi

can you get the events from that namespace?

rvanderp3 avatar Dec 14 '23 20:12 rvanderp3

The csi-node-driver, which is installed using tigera-operator, conflicts with vsphere-csi-node. I couldn't disable the installation of csi-node-driver, so I use kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

habibullinrsh avatar Feb 01 '24 07:02 habibullinrsh

Would be interesting to figure out together with https://github.com/kubernetes/cloud-provider-vsphere where the gaps are that both can run at the same time. (for CSI we simply consume the above).

chrischdi avatar Feb 01 '24 10:02 chrischdi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 01 '24 10:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 31 '24 10:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jun 30 '24 11:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 30 '24 11:06 k8s-ci-robot