eks-anywhere vsphere CSI nodes in crash loop backoff because `csi-vsphere-config` secret is missing vSphere creds

What happened: vsphere-csi-controller and vsphere-csi-node pods are in CrashLoopBackoff on workload cluster after performing upgrade.

Checking the logs shows

{"level":"error","time":"2022-09-19T20:15:16.271474904Z","caller":"config/config.go:272","msg":"vcConfig.User is empty for vc vsphere.testlab.local"

Finally, describing the csi-vsphere-config secret shows that the user and password fields are set to empty values while all the other fields are set properly

apiVersion: v1
kind: Secret
metadata:
  name: csi-vsphere-config
  namespace: kube-system
stringData:
  csi-vsphere.conf: |+
    [Global]
    cluster-id = "default/abhinav-workload"
    thumbprint = ""

    [VirtualCenter "vsphere.testlab.local"]
    user = ""
    password = ""
    datacenters = "Datacenter"
    insecure-flag = "false"

    [Network]
    public-network = "/Datacenter/network/network-1"
type: Opaque

How to reproduce it (as minimally and precisely as possible): Create a workload cluster from an existing management cluster and upgrade the workload cluster.

Environment:

EKS Anywhere Release: v0.11.1
EKS Distro Release: v1.23
OS: BottleRocket

Sep 20 '22 23:09 abhinavmpandey08

I wasn't able to reproduce the issue with missing secrets building an artifact from source tagged at v0.11.1.

I was able to reproduce the vsphere-csi-controller and vsphere-csi-node pods are in CrashLoopBackoff though, both with the following log message:

W0921 13:45:36.467668       1 connection.go:173] Still connecting to unix:///csi/csi.sock

Even before upgrade, I see warnings that seem significant in the vsphere-csi-node pod descriptions:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Normal   Scheduled         35m                    default-scheduler  Successfully assigned kube-system/vsphere-csi-node-fn8gx to 10.61.250.110
  Warning  NetworkNotReady   35m (x3 over 35m)      kubelet            network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Warning  FailedMount       35m (x4 over 35m)      kubelet            MountVolume.SetUp failed for volume "vsphere-config-volume" : object "kube-system"/"csi-vsphere-config" not registered
  Warning  FailedMount       34m (x7 over 35m)      kubelet            MountVolume.SetUp failed for volume "vsphere-config-volume" : object "kube-system"/"csi-vsphere-config" not registered
  Warning  NetworkNotReady   34m (x18 over 35m)     kubelet            network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Warning  DNSConfigForming  10m (x22 over 32m)     kubelet            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.106.49.30 10.106.49.51 10.106.151.90
  Warning  DNSConfigForming  4m12s (x2 over 5m24s)  kubelet            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.106.49.30 10.106.151.90 10.106.49.51
(END)

edit:

also found these in the controller:

{"level":"error","time":"2022-09-21T14:36:32.997536669Z","caller":"k8sorchestrator/k8sorchestrator.go:167","msg":"Failed to initialize the orchestrator. Error: configmaps \"internal-feature-states.csi.vsphere.vmware.com\" not found","TraceId":"c6eaff1d-85c2-4a23-8131-43cafa6cb097","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/k8sorchestrator.Newk8sOrchestrator\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/k8sorchestrator/k8sorchestrator.go:167\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco.GetContainerOrchestratorInterface\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/coagnostic.go:63\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/driver.go:119\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\tgithub.com/rexray/[email protected]/gocsi.go:246\nsync.(*Once).doSlow\n\tsync/once.go:68\nsync.(*Once).Do\n\tsync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\tgithub.com/rexray/[email protected]/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\tgithub.com/rexray/[email protected]/gocsi.go:130\nmain.main\n\tsigs.k8s.io/vsphere-csi-driver/v2/cmd/vsphere-csi/main.go:72\nruntime.main\n\truntime/proc.go:255"}
{"level":"error","time":"2022-09-21T14:36:32.997561517Z","caller":"commonco/coagnostic.go:65","msg":"creating k8sOrchestratorInstance failed. Err: configmaps \"internal-feature-states.csi.vsphere.vmware.com\" not found","TraceId":"c6eaff1d-85c2-4a23-8131-43cafa6cb097","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco.GetContainerOrchestratorInterface\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common/commonco/coagnostic.go:65\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/driver.go:119\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\tgithub.com/rexray/[email protected]/gocsi.go:246\nsync.(*Once).doSlow\n\tsync/once.go:68\nsync.(*Once).Do\n\tsync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\tgithub.com/rexray/[email protected]/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\tgithub.com/rexray/[email protected]/gocsi.go:130\nmain.main\n\tsigs.k8s.io/vsphere-csi-driver/v2/cmd/vsphere-csi/main.go:72\nruntime.main\n\truntime/proc.go:255"}
{"level":"error","time":"2022-09-21T14:36:32.99758526Z","caller":"service/driver.go:122","msg":"Failed to create CO agnostic interface. Error: configmaps \"internal-feature-states.csi.vsphere.vmware.com\" not found","TraceId":"c6eaff1d-85c2-4a23-8131-43cafa6cb097","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\tsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/driver.go:122\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\tgithub.com/rexray/[email protected]/gocsi.go:246\nsync.(*Once).doSlow\n\tsync/once.go:68\nsync.(*Once).Do\n\tsync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\tgithub.com/rexray/[email protected]/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\tgithub.com/rexray/[email protected]/gocsi.go:130\nmain.main\n\tsigs.k8s.io/vsphere-csi-driver/v2/cmd/vsphere-csi/main.go:72\nruntime.main\n\truntime/proc.go:255"}

Sep 21 '22 14:09 jonathanmeier5

Offline convo with @abhinavmpandey08 was very helpful. I was able to replicate the issue, seeing an empty username/password value in csi-vsphere-config in eksa-system.

$  k --kubeconfig minimal-privs-cluster/minimal-privs-cluster-eks-a-cluster.kubeconfig -n eksa-system get secrets csi-vsphere-config -o jsonpath='{.data.*}' | base64 -d
apiVersion: v1
kind: Secret
metadata:
  name: csi-vsphere-config
  namespace: kube-system
stringData:
  csi-vsphere.conf: |+
    [Global]
    cluster-id = "default/workload-cluster-2"
    thumbprint = ""

    [VirtualCenter "10.61.250.74"]
    user = ""
    password = ""
    datacenters = "Datacenter"
    insecure-flag = "true"

    [Network]
    public-network = "/Datacenter/network/VM Network"
type: Opaque

However, the secrets were correctly populated in the workload clusters:

$  k --kubeconfig workload-cluster/workload-cluster-eks-a-cluster.kubeconfig -n kube-system get secrets csi-vsphere-config -o jsonpath='{.data.*}' | base64 -d
[Global]
cluster-id = "default/workload-cluster"
thumbprint = ""

[VirtualCenter "10.61.250.74"]
user = "[email protected]"
password = "***"
datacenters = "Datacenter"
insecure-flag = "true"

[Network]
public-network = "/Datacenter/network/VM Network"

Sep 21 '22 17:09 jonathanmeier5

I'm still not sure why csi-vsphere-config in management cluster's eksa-system is losing its creds.

To make things more confusing, I was able to run an upgrade on a worker cluster from a management cluster with a compromised eksa-system:csi-vsphere-config secret without breaking the workload cluster.

In my situation, the csi driver in my workload cluster was broken due to a this config map being missing from the workload cluster:

apiVersion: v1
kind: ConfigMap
metadata:
  name: internal-feature-states.csi.vsphere.vmware.com
  namespace: kube-system
data:
  csi-migration: "false"

Sep 21 '22 21:09 jonathanmeier5

This should be resolved by https://github.com/aws/eks-anywhere/pull/3424 /close

Oct 10 '22 20:10 abhinavmpandey08

@abhinavmpandey08: Closing this issue.

In response to this:

This should be resolved by https://github.com/aws/eks-anywhere/pull/3424 /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 10 '22 20:10 eks-distro-bot