vsphere-csi-driver icon indicating copy to clipboard operation
vsphere-csi-driver copied to clipboard

vsphere csi controller pods crashloop without multi-vcenter feature gate

Open gnufied opened this issue 8 months ago • 5 comments

We do not yet support multiple vCenters in our deployment and hence we do not have this feature gate enabled, but running in single datacenter and multiple compute cluster topologies or multiple datacenters but single vCenter topologies the controller pod crashes with:

{"level":"error","time":"2023-11-24T17:16:30.532383276Z","caller":"service/driver.go:203","msg":"failed to run the driver. 
Err: +failed to update cache with topology information. Error: failed to get vCenterInstance for vCenter Host: \"vcs8e-
vc.ocp2.dev.cluster.com\". Error: virtual center was already registered","TraceId":"da5779b6-e99a-475b-
b300-350dfa441f1e","stacktrace":"..."}

More information in CI - https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/139/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere-zones/1728081801684979712/artifacts/e2e-vsphere-zones/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-controller-574c8c86db-cs8gh_csi-driver.log

Ideally this should not be the case. If a feature is needed for driver to function, it should be enabled by default.

But what is more interesting is enabling the multi-vcenter-csi-topology feature gate although fixes driver pod from crashing, now volume provisioning works in only zone and fails in other zone:

  Warning  ProvisioningFailed    17s (x8 over 80s)  csi.vsphere.vmware.com_vmware-vsphere-csi-driver-controller-
566cb79f5c-jv7kp_a0a2e0c8-2c18-4845-9ae1-8fb0a6bddfec  failed to provision volume with StorageClass "thin-csi": 
rpc error: code = Internal desc = failed to create volume. Errors encountered: [No compatible datastores found for 
accessibility requirements [map[topology.csi.vmware.com/openshift-region:us-east-1 
topology.csi.vmware.com/openshift-zone:us-east-1a]] pertaining to vCenter "vcenter.blah.lan"]

This used to work before without any problem.

cc @divyenpatel @xing-yang

gnufied avatar Dec 06 '23 13:12 gnufied

More specifically, what we have noticed is - in 3.1.1, volume provisioning does not work if hosts are tagged. It only works if either computer clusters or datacenters are tagged. Is that expected?

Provisioning also does not work, if somehow hosts and compute clusters have identical topology tags. For example:

I have a compute cluster - "compute1" with tags - zone/region - us-east-1a/us-east-1 My compute cluster has a single host called - "exga.home.dev" with same tags as above.

I also have a compute2 compute cluster with similar tags.

With new version of CSI driver, volume provisioning no longer works in this environment.

gnufied avatar Dec 07 '23 19:12 gnufied

@gnufied We are aware that volume provisioning will not work if topology tags are applied on a standalone host i.e ComputeResource but the use case you are talking about should work. Volume provisioning should work on a HostSystem present under a ClusterComputeResource i.e vSphere cluster, both having the same tags. What is the error you are seeing on the PVC when you try this? Could you give us the logs from vsphere-csi-controller container when such a volume provisioning request is failing?

shalini-b avatar Dec 07 '23 19:12 shalini-b

Here is the backtrace when it failed. We are not using standalone hosts, but hosts are still part of the compute cluster. It is just that, host themselves are tagged:

2023-12-07T01:05:34.354255765Z common/topology.go:158 Hosts returned for topology category: "topology.csi.vmware.com/openshift-region" and tag: "us-east-1" are [HostSystem:host-14 HostSystem:host-14 HostSystem:host-10088]
  2023-12-07T01:05:34.354266166Z common/topology.go:170 finding common hosts for hostlists: [[HostSystem:host-14 HostSystem:host-14] [HostSystem:host-14 HostSystem:host-14 HostSystem:host-10088]]
  2023-12-07T01:05:34.354277416Z common/topology.go:162 common hosts: [] for all segments: map[topology.csi.vmware.com/openshift-region:us-east-1 topology.csi.vmware.com/openshift-zone:us-east-1a]
  2023-12-07T01:05:34.354285936Z placementengine/placement.go:86 Obtained list of shared datastores [] for hosts []
  2023-12-07T01:05:34.383697314Z placementengine/placement.go:107 Datastores compatible with storage policy "f1f4efa3-7546-4a26-907c-f464ccfb0e8b" are map[datastore-10112:{} datastore-20001:{}] for vCenter: "vcenter.home.lan"
  2023-12-07T01:05:34.383725315Z placementengine/placement.go:119 No compatible shared datastores found for storage policy "f1f4efa3-7546-4a26-907c-f464ccfb0e8b" on vCenter: "vcenter.home.lan"
  2023-12-07T01:05:34.383735945Z vanilla/controller.go:1321 No compatible datastores found for accessibility requirements [map[topology.csi.vmware.com/openshift-region:us-east-1 topology.csi.vmware.com/openshift-zone:us-east-1a]] pertaining to vCenter "vcenter.home.lan"
  2023-12-07T01:05:34.383750225Z vanilla/controller.go:1433 failed to create volume. Errors encountered: [No compatible datastores found for accessibility requirements [map[topology.csi.vmware.com/openshift-region:us-east-1 topology.csi.vmware.com/openshift-zone:us-east-1a]] pertaining to vCenter "vcenter.home.lan"]
  sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).createBlockVolumeWithPlacementEngineForMultiVC
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/vanilla/controller.go:1433
  sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume.func1
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/vanilla/controller.go:1842
  sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/vanilla/controller.go:1847
  github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5671
  google.golang.org/grpc.(*Server).processUnaryRPC
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1283
  google.golang.org/grpc.(*Server).handleStream
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1620
  google.golang.org/grpc.(*Server).serveStreams.func1.2
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:922
  2023-12-07T01:05:34.383788495Z vanilla/controller.go:1853 Operation failed, reporting failure status to Prometheus. Operation Type: "create-volume", Volume Type: "block", Fault Type: "csi.fault.Internal"
  sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/vanilla/controller.go:1853
  github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5671
  google.golang.org/grpc.(*Server).processUnaryRPC
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1283
  google.golang.org/grpc.(*Server).handleStream
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1620
  google.golang.org/grpc.(*Server).serveStreams.func1.2
          /go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:92

gnufied avatar Dec 07 '23 22:12 gnufied

remove tag from cluster if you want specific host within cluster to be part of az or remove tag from host, if you want all hosts to be part of az.

we should not have same tag on the parent entity and child entity. it is invalid configuration.

if AZ tag is assigned on parent entity, it is assumed that all child entities are part of parent entity's AZ.

divyenpatel avatar Dec 08 '23 00:12 divyenpatel

remove tag from cluster if you want specific host within cluster to be part of az or remove tag from host, if you want all hosts to be part of az.

we should not have same tag on the parent entity and child entity. it is invalid configuration.

if AZ tag is assigned on parent entity, it is assumed that all child entities are part of parent entity's AZ.

Although it's an invalid configuration, this used to work on previous versions of the driver. I worry this change in behavior could break volume provisioning for some users after upgrade.

Since it all child entities are part of the parent entity's AZ, I tested a relatively small change in the driver to ignore child entity tags in that case and allow volume provisioning to proceed. Please take a look: https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/2814

dobsonj avatar Mar 07 '24 01:03 dobsonj