vsphere CSI nodes in crash loop backoff because `csi-vsphere-config` secret is missing vSphere creds
What happened:
vsphere-csi-controller and vsphere-csi-node pods are in CrashLoopBackoff on workload cluster after performing upgrade.
Checking the logs shows
{"level":"error","time":"2022-09-19T20:15:16.271474904Z","caller":"config/config.go:272","msg":"vcConfig.User is empty for vc vsphere.testlab.local"
Finally, describing the csi-vsphere-config secret shows that the user and password fields are set to empty values while all the other fields are set properly
apiVersion: v1
kind: Secret
metadata:
name: csi-vsphere-config
namespace: kube-system
stringData:
csi-vsphere.conf: |+
[Global]
cluster-id = "default/abhinav-workload"
thumbprint = ""
[VirtualCenter "vsphere.testlab.local"]
user = ""
password = ""
datacenters = "Datacenter"
insecure-flag = "false"
[Network]
public-network = "/Datacenter/network/network-1"
type: Opaque
How to reproduce it (as minimally and precisely as possible): Create a workload cluster from an existing management cluster and upgrade the workload cluster.
Environment:
- EKS Anywhere Release:
v0.11.1 - EKS Distro Release:
v1.23 - OS:
BottleRocket
I wasn't able to reproduce the issue with missing secrets building an artifact from source tagged at v0.11.1.
I was able to reproduce the vsphere-csi-controller and vsphere-csi-node pods are in CrashLoopBackoff though, both with the following log message:
W0921 13:45:36.467668 1 connection.go:173] Still connecting to unix:///csi/csi.sock
Even before upgrade, I see warnings that seem significant in the vsphere-csi-node pod descriptions:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 35m default-scheduler Successfully assigned kube-system/vsphere-csi-node-fn8gx to 10.61.250.110
Warning NetworkNotReady 35m (x3 over 35m) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Warning FailedMount 35m (x4 over 35m) kubelet MountVolume.SetUp failed for volume "vsphere-config-volume" : object "kube-system"/"csi-vsphere-config" not registered
Warning FailedMount 34m (x7 over 35m) kubelet MountVolume.SetUp failed for volume "vsphere-config-volume" : object "kube-system"/"csi-vsphere-config" not registered
Warning NetworkNotReady 34m (x18 over 35m) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Warning DNSConfigForming 10m (x22 over 32m) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.106.49.30 10.106.49.51 10.106.151.90
Warning DNSConfigForming 4m12s (x2 over 5m24s) kubelet Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.106.49.30 10.106.151.90 10.106.49.51
(END)
edit:
also found these in the controller:
{"level":"error","time":"2022-09-21T14:36:32.997536669Z","caller":"k8sorchestrator/k8sorchestrator.go:167","msg":"Failed to initialize the orchestrator. Error: configmaps \"internal-feature-states.csi.vsphere.vmware.com\" not found","TraceId":"c6eaff1d-85c2-4a23-8131-43cafa6cb097","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/k8sorchestrator.Newk8sOrchestrator\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/k8sorchestrator/k8sorchestrator.go:167\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco.GetContainerOrchestratorInterface\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/coagnostic.go:63\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/driver.go:119\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\tgithub.com/rexray/[email protected]/gocsi.go:246\nsync.(*Once).doSlow\n\tsync/once.go:68\nsync.(*Once).Do\n\tsync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\tgithub.com/rexray/[email protected]/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\tgithub.com/rexray/[email protected]/gocsi.go:130\nmain.main\n\tsigs.k8s.io/vsphere-csi-driver/v2/cmd/vsphere-csi/main.go:72\nruntime.main\n\truntime/proc.go:255"}
{"level":"error","time":"2022-09-21T14:36:32.997561517Z","caller":"commonco/coagnostic.go:65","msg":"creating k8sOrchestratorInstance failed. Err: configmaps \"internal-feature-states.csi.vsphere.vmware.com\" not found","TraceId":"c6eaff1d-85c2-4a23-8131-43cafa6cb097","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco.GetContainerOrchestratorInterface\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/coagnostic.go:65\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/driver.go:119\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\tgithub.com/rexray/[email protected]/gocsi.go:246\nsync.(*Once).doSlow\n\tsync/once.go:68\nsync.(*Once).Do\n\tsync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\tgithub.com/rexray/[email protected]/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\tgithub.com/rexray/[email protected]/gocsi.go:130\nmain.main\n\tsigs.k8s.io/vsphere-csi-driver/v2/cmd/vsphere-csi/main.go:72\nruntime.main\n\truntime/proc.go:255"}
{"level":"error","time":"2022-09-21T14:36:32.99758526Z","caller":"service/driver.go:122","msg":"Failed to create CO agnostic interface. Error: configmaps \"internal-feature-states.csi.vsphere.vmware.com\" not found","TraceId":"c6eaff1d-85c2-4a23-8131-43cafa6cb097","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/driver.go:122\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\tgithub.com/rexray/[email protected]/gocsi.go:246\nsync.(*Once).doSlow\n\tsync/once.go:68\nsync.(*Once).Do\n\tsync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\tgithub.com/rexray/[email protected]/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\tgithub.com/rexray/[email protected]/gocsi.go:130\nmain.main\n\tsigs.k8s.io/vsphere-csi-driver/v2/cmd/vsphere-csi/main.go:72\nruntime.main\n\truntime/proc.go:255"}
Offline convo with @abhinavmpandey08 was very helpful. I was able to replicate the issue, seeing an empty username/password value in csi-vsphere-config in eksa-system.
$ k --kubeconfig minimal-privs-cluster/minimal-privs-cluster-eks-a-cluster.kubeconfig -n eksa-system get secrets csi-vsphere-config -o jsonpath='{.data.*}' | base64 -d
apiVersion: v1
kind: Secret
metadata:
name: csi-vsphere-config
namespace: kube-system
stringData:
csi-vsphere.conf: |+
[Global]
cluster-id = "default/workload-cluster-2"
thumbprint = ""
[VirtualCenter "10.61.250.74"]
user = ""
password = ""
datacenters = "Datacenter"
insecure-flag = "true"
[Network]
public-network = "/Datacenter/network/VM Network"
type: Opaque
However, the secrets were correctly populated in the workload clusters:
$ k --kubeconfig workload-cluster/workload-cluster-eks-a-cluster.kubeconfig -n kube-system get secrets csi-vsphere-config -o jsonpath='{.data.*}' | base64 -d
[Global]
cluster-id = "default/workload-cluster"
thumbprint = ""
[VirtualCenter "10.61.250.74"]
user = "[email protected]"
password = "***"
datacenters = "Datacenter"
insecure-flag = "true"
[Network]
public-network = "/Datacenter/network/VM Network"
I'm still not sure why csi-vsphere-config in management cluster's eksa-system is losing its creds.
To make things more confusing, I was able to run an upgrade on a worker cluster from a management cluster with a compromised eksa-system:csi-vsphere-config secret without breaking the workload cluster.
In my situation, the csi driver in my workload cluster was broken due to a this config map being missing from the workload cluster:
apiVersion: v1
kind: ConfigMap
metadata:
name: internal-feature-states.csi.vsphere.vmware.com
namespace: kube-system
data:
csi-migration: "false"
This should be resolved by https://github.com/aws/eks-anywhere/pull/3424 /close
@abhinavmpandey08: Closing this issue.
In response to this:
This should be resolved by https://github.com/aws/eks-anywhere/pull/3424 /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.