vsphere-csi-driver icon indicating copy to clipboard operation
vsphere-csi-driver copied to clipboard

Bug Report - vsphere-csi-driver disabled px storage cluster

Open ibrassfield opened this issue 11 months ago • 6 comments

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened: I have an OpenShift 4.12 baremetal cluster with a mix of vSphere VMs and Baremetal nodes. The baremetal nodes are configured to be portworx storage providers so I do not want the vsphere.csi.driver to interact with those specific worker nodes. I want to apply the vsphere.csi.driver only to the infrastructure nodes in my cluster -- meaning vsphere would only apply store to 3 specific nodes. When I tried to install following instructions. The vsphere.csi.driver knocked my portworx cluster offline which caused a bunch of problems.

What you expected to happen: I was expecting to be able to have multiple csi.drivers in mycluster and also be able to only apply this VMware store to the infrastructure nodes, which are sitting in vmware.

How to reproduce it (as minimally and precisely as possible): To reproduce this error i would just apply the vsphere-csi-driver configs to the cluster

Anything else we need to know?:

Environment:

  • csi-vsphere version: latest
  • vsphere-cloud-controller-manager version: latest
  • Kubernetes version: 1.25/OpenShift 4.12
  • vSphere version:
  • OS (e.g. from /etc/os-release): RHCOS/OpenShift
  • Kernel (e.g. uname -a):
  • Install tools: oc client
  • Others:

ibrassfield avatar Sep 20 '23 19:09 ibrassfield

cc: @gnufied

divyenpatel avatar Sep 21 '23 23:09 divyenpatel

So I assume that - this cluster was deployed as baremetal cluster type when deploying OCP? Because Openshift by default installs a vSphere CSI driver on all nodes in the cluster in 4.12 and it can't be disable or turned off.

Can you confirm, what kind of platform integration you chose when installing OCP?

Assuming baremetal installs - it should be possible to install vsphere driver separately and portworx drivers separately (at least in theory).

The vsphere.csi.driver knocked my portworx cluster offline which caused a bunch of problems.

Can you elaborate? Did you set node-selectors for both controller and daemonset appropriately?

gnufied avatar Sep 25 '23 20:09 gnufied

So I assume that - this cluster was deployed as baremetal cluster type when deploying OCP? Because Openshift by default installs a vSphere CSI driver on all nodes in the cluster in 4.12 and it can't be disable or turned off.

Can you confirm, what kind of platform integration you chose when installing OCP?

Assuming baremetal installs - it should be possible to install vsphere driver separately and portworx drivers separately (at least in theory).

The vsphere.csi.driver knocked my portworx cluster offline which caused a bunch of problems.

Thanks for the response.

Yes this is a baremetal cluster type. So there is no specific platform integration.

I did use node-selectors on the daemon sets but maybe not the controller and not sure how to do that.

When I mean knocked my portworx cluster online. What I mean is that it replaced my portworx csi driver in priority and set itself as default. Somehow that disconnected the Array Blade communication from the Openshift cluster - which forced us to redeploy and request new licensing for the cluster

ibrassfield avatar Sep 26 '23 12:09 ibrassfield

It is hard to say much without looking at logs and cluster configuration. I would recommend opening a ticket against Openshift and provide all details such as must-gather and then oc adm inspect ( https://docs.openshift.com/container-platform/4.13/cli_reference/openshift_cli/administrator-cli-commands.html#oc-adm-inspect ) output of both namespace in which vsphere and portworx drivers are deployed.

gnufied avatar Sep 29 '23 16:09 gnufied

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 29 '24 10:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 28 '24 11:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 29 '24 11:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 29 '24 11:03 k8s-ci-robot