ceph-csi icon indicating copy to clipboard operation
ceph-csi copied to clipboard

nvmeof: QoS Support for NVMe-oF CSI Driver

Open gadididi opened this issue 2 months ago • 7 comments

QoS Support for NVMe-oF CSI Driver

Overview

Add QoS (Quality of Service) support for NVMe-oF namespaces, allowing users to control IOPS and bandwidth limits both at volume creation and during runtime.

Proposed Implementation

1. QoS at Volume Creation (StorageClass)

Set initial QoS limits via StorageClass parameters. If omitted, namespaces remain unlimited.

Example StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-nvmeof-standard
provisioner: nvmeof.csi.ceph.com
parameters:
  pool: mypool
  # Optional QoS parameters
  qosRwIopsPerSecond: "10000"
  qosRwMegabytesPerSecond: "100"
  qosReadMegabytesPerSecond: "150"
  qosWriteMegabytesPerSecond: "50"

Implementation: Modify ControllerCreateVolume() to:

  • Parse QoS parameters from StorageClass
  • Call ns set_qos API after namespace creation (if parameters exist)
  • Handle missing parameters gracefully (no QoS = unlimited)

2. Runtime QoS Modification (VolumeAttributesClass)

Enable QoS changes on existing volumes without recreation using CSI ControllerModifyVolume().

Example VolumeAttributesClass:

apiVersion: storage.k8s.io/v1beta1
kind: VolumeAttributesClass
metadata:
  name: high-performance
driverName: nvmeof.csi.ceph.com
parameters:
  qosRwIopsPerSecond: "50000"
  qosRwMegabytesPerSecond: "500"

Now you want to hook the pvc with the VAC, you should do: Usage:

# Apply QoS to existing PVC
kubectl patch pvc my-pvc -p '{"spec":{"volumeAttributesClassName":"high-performance"}}'

Implementation: Add ControllerModifyVolume() RPC to:

  • Parse QoS parameters from VolumeAttributesClass
  • Call ns set_qos via GRPC to the GW
  • Deploy csi-resizer sidecar to monitor VAC changes

Supported QoS Parameters

All parameters map directly to NVMe-oF gateway ns set_qos command:

  • qosRwIopsPerSecond - R/W IOPS limit (0 = unlimited)
  • qosRwMegabytesPerSecond - R/W bandwidth limit
  • qosReadMegabytesPerSecond - Read bandwidth limit
  • qosWriteMegabytesPerSecond - Write bandwidth limit

Requirements

  • Kubernetes 1.29+ (for VolumeAttributesClass support)
  • CSI spec 1.9.0+ (for ControllerModifyVolume)
  • MODIFY_VOLUME controller capability
  • csi-resizer sidecar container

gadididi avatar Oct 21 '25 09:10 gadididi

@gadididi Is nvme-of QOS keys are similar to nbd QOS keys, we already have QOS for nbd https://github.com/ceph/ceph-csi/blob/72c09d3d8758d058575d34b2da4b09eb0a591f8f/examples/rbd/storageclass.yaml#L168-L217, can we have same keys or similar keys so that we can use lot of internal functions and it will be easy for users as well so that they will have same keys in SC.

Madhu-1 avatar Oct 21 '25 09:10 Madhu-1

@Madhu-1 Hi, sure if we can use common code is good. the current params for QoS for NVMe-oF are:

Set QOS limits for a namespace

optional arguments:
  -h, --help            show this help message and exit
  --subsystem SUBSYSTEM, -n SUBSYSTEM
                        Subsystem NQN
  --nsid NSID           Namespace ID
  --rw-ios-per-second RW_IOS_PER_SECOND
                        R/W IOs per second limit, 0 means unlimited
  --rw-megabytes-per-second RW_MEGABYTES_PER_SECOND
                        R/W megabytes per second limit, 0 means unlimited
  --r-megabytes-per-second R_MEGABYTES_PER_SECOND
                        Read megabytes per second limit, 0 means unlimited
  --w-megabytes-per-second W_MEGABYTES_PER_SECOND
                        Write megabytes per second limit, 0 means unlimited
  --force               Set QOS limits even if they were changed by RBD

there is param named force . I need to check what is the consequences of using the same keys in the SC, and let you know. there is also a requirement to modify the volume(=nvmeof ns) "on the fly" , so do you think ControllerModifyVolume() is proper solution for it?

gadididi avatar Oct 22 '25 08:10 gadididi

so do you think ControllerModifyVolume() is proper solution for it?

Yes thats correct, we can support changing the QOS without any requirement like remount or any node operations, its the way to go.

Madhu-1 avatar Oct 22 '25 08:10 Madhu-1

@Madhu-1 Hi!!,

Response: Should we reuse RBD QoS keys for NVMe-oF?

After looking into this more deeply, I don't think we should reuse the RBD QoS keys for NVMe-oF. Here's why:

The RBD QoS parameters like baseIops, maxIops, and iopsPerGiB are designed for a capacity-based calculation model where the QoS limits scale dynamically with the volume size. This makes sense for RBD because the QoS is applied at the image level in the storage backend.

NVMe-oF gateway QoS works completely differently. It's applied at the network/SPDK layer and just takes static absolute values - there's no calculation or scaling involved. You just tell it "limit this to 10000 IOPS" and that's what it does, regardless of volume size.

More importantly, these two QoS mechanisms don't actually work well together. If an RBD image already has QoS configured, the NVMe-oF gateway QoS won't do anything unless you use the --force flag, which isn't recommended. They're operating at different layers and can conflict with each other.

So for NVMe-oF, I think we should use simple, descriptive parameters like nvmeofRwIopsPerSecond and nvmeofRwMegabytesPerSecond. The implementation is straightforward - we just pass these values directly to the gateway API (via the GRPC) without any calculation. Different keys will also make it clear to users that this is a different QoS mechanism.

What do you think?

gadididi avatar Oct 23 '25 09:10 gadididi

@Madhu-1 Hi!!,

Response: Should we reuse RBD QoS keys for NVMe-oF?

After looking into this more deeply, I don't think we should reuse the RBD QoS keys for NVMe-oF. Here's why:

The RBD QoS parameters like baseIops, maxIops, and iopsPerGiB are designed for a capacity-based calculation model where the QoS limits scale dynamically with the volume size. This makes sense for RBD because the QoS is applied at the image level in the storage backend.

NVMe-oF gateway QoS works completely differently. It's applied at the network/SPDK layer and just takes static absolute values - there's no calculation or scaling involved. You just tell it "limit this to 10000 IOPS" and that's what it does, regardless of volume size.

More importantly, these two QoS mechanisms don't actually work well together. If an RBD image already has QoS configured, the NVMe-oF gateway QoS won't do anything unless you use the --force flag, which isn't recommended. They're operating at different layers and can conflict with each other.

So for NVMe-oF, I think we should use simple, descriptive parameters like nvmeofRwIopsPerSecond and nvmeofRwMegabytesPerSecond. The implementation is straightforward - we just pass these values directly to the gateway API (via the GRPC) without any calculation. Different keys will also make it clear to users that this is a different QoS mechanism.

What do you think?

@gadididi Thanks for the details explanation, Make sense as both are completely different implementation and uses different keys, we can have different keys in the SC for this driver

Madhu-1 avatar Oct 23 '25 11:10 Madhu-1

A few more things to do once #5614 is merged:

  • add example yaml files for a VolumeAttributeClass with QoS limits and one that removes the limits
  • add e2e testing, depends on #5641
  • document the feature, and it's dependency on Kubernetes 1.34 (and maybe kubernetes-csi/external-provisioner#1440 and kubernetes-csi/external-resizer#544)

nixpanic avatar Nov 14 '25 16:11 nixpanic

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 14 '25 21:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] avatar Dec 22 '25 21:12 github-actions[bot]