gcp-compute-persistent-disk-csi-driver icon indicating copy to clipboard operation
gcp-compute-persistent-disk-csi-driver copied to clipboard

hyperdisk-balanced topology issues

Open jsafrane opened this issue 1 year ago • 2 comments
trafficstars

hyperdisks-balanced disks are not usable on most (?) VMs. Similarly, regular persistent disks are not usable on N4/C4 VMs. It makes scheduling of Pods that use hyperdisk-balanced PVs challenging on clusters with mixed VMs, say N2 and N4.

Are there any guidelines how to configure the CSI driver and StorageClasses so PVCs that are scheduled to N4 VMs use hyperdisk-balanced disks and PVCs that are scheduled to N2 VMs use standard PDs?

Right now I can imagine putting all N4 machines into a single availability zone + make sure that there is no N2 VM there. I can then create two dedicated StorageClasses:

  1. hyperdisk: with allowedTopologies targeting the availability zone with N4 machines + type: hyperdisk-balanced.
  2. disk: with allowedTopologies targeting all other AZs with type: pd-standard.

Scheduler is then able to choose the right nodes that use PVs provisioned from these StorageClasses. But it's quite cumbersome to set up.

It feels like there should be two separate CSI drivers, with separate topologies and attach limits.

jsafrane avatar Sep 02 '24 13:09 jsafrane

We don't have a great solution for this. We're working on some ideas. The attach limit is a problem for sure. Using separate CSI drivers would fix it, but it starts getting silly in terms of node resource consumption, especially given that we need to reserve space for mount-time operations like fsck and mkfs that can consume a lot of memory for large volumes.

mattcary avatar Sep 03 '24 16:09 mattcary

The problem is worse. Each hyperdisk type has different supported machine types and volume limits. So you would essentially need one CSI driver per disk type.

One idea we did discuss in the past was to have the ability for a CSI driver to be registered with multiple names. It would require all the sidecars to be able to handle processing requests from multiple csi drivers. It would also require the user to explicitly use a different driver name in the storage class, which could also complicate things if we wanted to support transparently changing disk types.

msau42 avatar Sep 04 '24 18:09 msau42

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 03 '24 19:12 k8s-triage-robot

Some ideas for the general problem are being discussed in this doc. When those get closer to reality we'll add issues & PRs in this repo.

/close

mattcary avatar Dec 03 '24 20:12 mattcary

@mattcary: Closing this issue.

In response to this:

Some ideas for the general problem are being discussed in this doc. When those get closer to reality we'll add issues & PRs in this repo.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Dec 03 '24 20:12 k8s-ci-robot