csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

Support Zone-Aware Topologies in the KubeVirt CSI Driver

Open moadqassem opened this issue 10 months ago • 23 comments

What this PR does / why we need it: This pull request introduces support for zone-aware topology in the KubeVirt CSI driver. This enhancement allows the CSI driver to handle storage provisioning and attachment in multi-zone Kubernetes clusters. Integrates with Kubernetes topology-aware volume provisioners to ensure volumes are created in the same zone/region as the requesting node by:

  • Implements support for TopologyKeys to enable zone/region-specific volume scheduling and attachment
  • Modifies the Controller plugin to handle topology constraints during volume provisioning
  • Updates to the Node plugin to include zone information in NodeGetInfo responses

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #

Special notes for your reviewer:

Release note:

Support zone and region-aware topologies in the KubeVirt CSI Driver

moadqassem avatar Jan 14 '25 14:01 moadqassem

Hi @moadqassem. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot avatar Jan 14 '25 14:01 kubevirt-bot

/test all

awels avatar Jan 14 '25 23:01 awels

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign mhenriks for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-bot avatar Mar 30 '25 16:03 kubevirt-bot

@awels I have fixed some tests files that were failing and update the PR to resolve few conflicts. However, there were some other issues which I believe are not relevant to my PR:

I0114 23:43:26.714262   37496 create-pvc_test.go:608] Event: AttachVolume.Attach failed for volume "tenant-pv" : rpc error: code = Unknown desc = invalid volume name [FailedAttachVolume]

And

 failed to provision volume with StorageClass "kubevirt": rpc error: code = InvalidArgument desc = non-block volume with RWX access mode is not supported

I also took a quick look into the testing process and In order to add an e2e k8s test to make sure that the zone and region are respected, I must change few things in the kubevirtci cluster provider(or add a new provider, haven't looked deep so not sure). This is where it needs to be adjusted to add labels that points the allowed topology: https://github.com/kubevirt/kubevirtci/blob/a291de27bb596074c79729ea6f88533555f523fd/cluster-up/cluster/ephemeral-provider-common.sh#L90

Let me know what do you think?

moadqassem avatar Mar 30 '25 18:03 moadqassem

/test all

awels avatar Apr 02 '25 12:04 awels

Sorry been very busy with other stuff, I will try to take a look at this soon.

awels avatar Apr 02 '25 12:04 awels

If you look at the testing, we do actually exclude a few tests from the k8s test suite. In particular the RWX filesystem one, since we don't support that.

awels avatar Apr 02 '25 12:04 awels

If you look at the testing, we do actually exclude a few tests from the k8s test suite. In particular the RWX filesystem one, since we don't support that.

I see. Alright let me check this filtering criteria.

moadqassem avatar Apr 02 '25 12:04 moadqassem

https://github.com/kubevirt/csi-driver/blob/main/hack/run-k8s-e2e.sh#L117-L128 is what we use, I believe the last skip is what skips the RWX filesystem test.

awels avatar Apr 02 '25 12:04 awels

Okay so I think I know what happened here, we made a change in how we identify which VM the volume is hotplugged into. Before we used the VMI, but that proved to be problematic, now we look at the VM directly. I am not sure if that affects how you look up VMs, it likely does.

awels avatar Apr 03 '25 16:04 awels

Okay so I think I know what happened here, we made a change in how we identify which VM the volume is hotplugged into. Before we used the VMI, but that proved to be problematic, now we look at the VM directly. I am not sure if that affects how you look up VMs, it likely does.

hmm I don't think that I am listing or trying to get any VM as the code that I wrote is running on the DS which is in general deals with Node reference rather than the VM itself. The problem is, those tests are failing and I can't really check out what was the issue as you probably know the error traces(such as log messages) might come from other controllers which are not tracked. Also I noticed other PRs that have the exact failures. Do you know how we should proceed here ?

moadqassem avatar May 13 '25 10:05 moadqassem

/test all

awels avatar May 13 '25 12:05 awels

@awels After taking a deeper look, seems that these lines of code here broke the VMs lookups to fetch VMs info from the CSINode object https://github.com/kubevirt/csi-driver/pull/130/files#diff-6cb93ab3a359a5bd44c54dbb81ac13c78f2051b868087906408564274b5e94bbR337-R349. These new additions, changed how we fetched those VMs where the volume attachments should be happening in regards how the CSINode node id is being processed.

moadqassem avatar May 13 '25 15:05 moadqassem

@awels can you please rerun the ci tests again. I have tested my recent changes and I was able to fix the problem,

moadqassem avatar May 13 '25 15:05 moadqassem

/test all

awels avatar May 13 '25 15:05 awels

/test pull-csi-driver-split-k8s-suite-k8s

awels avatar May 13 '25 17:05 awels

PR looks good, one thing I forgot to ask about, did you enabled the topology tests in the k8s test suite. You can modify https://github.com/kubevirt/csi-driver/blob/main/hack/test-driver.yaml and set topology: true and that should be all that is needed to run the topology tests.

awels avatar May 13 '25 20:05 awels

/test all

awels avatar May 13 '25 23:05 awels

/test all

awels avatar May 14 '25 14:05 awels

@moadqassem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-csi-driver-e2e-k8s 4b2711d22d9b43ce2feab12eb5bec82e8ed6c434 link true /test pull-csi-driver-e2e-k8s
pull-csi-driver-split-e2e-k8s 4b2711d22d9b43ce2feab12eb5bec82e8ed6c434 link true /test pull-csi-driver-split-e2e-k8s
pull-csi-driver-split-k8s-suite-k8s 4b2711d22d9b43ce2feab12eb5bec82e8ed6c434 link true /test pull-csi-driver-split-k8s-suite-k8s

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kubevirt-bot avatar May 14 '25 16:05 kubevirt-bot

Looks like we are having an issue with the tests when we enable topology here. I still have not had a change to look at why.

awels avatar Jun 05 '25 13:06 awels

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot avatar Sep 03 '25 13:09 kubevirt-bot

/remove-lifecycle stale

awels avatar Sep 03 '25 14:09 awels

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot avatar Dec 02 '25 14:12 kubevirt-bot

/remove-lifecycle stale

awels avatar Dec 02 '25 14:12 awels