csi-digitalocean icon indicating copy to clipboard operation
csi-digitalocean copied to clipboard

Support for NVMe volumes

Open artem-zinnatullin opened this issue 3 years ago • 9 comments

Hi!

We're looking for an automated way to provision PersistentVolumeClaims against locally mounted NVMe drives on DigitalOcean https://www.digitalocean.com/blog/introducing-storage-optimized-droplets-with-nvme-ssds/

We've tried local StorageClass https://kubernetes.io/docs/concepts/storage/storage-classes/#local, it does work however it is not automated at all, unlike DO Block Storage in k8s:

  • We have to manually create PerstistentVolumes
  • Each PersistentVolume has to be constrained to a particular node with nodeAffinity
  • Each PersistentVolume has to have capacity manually defined, however it does not act as a limit since NVMe storage is mounted as root / filesystem on Premium and Storage Optimized Droplets with NVMe
  • Each PersistentVolume must have only one assosiated PersistentVolumeClaim otherwise Pods using it will not be scheduled
  • Each new Node added to cluster will have to have PVs and PVCs configured, which defeats the benefit of k8s autoscaling.

We're looking into CSI implementations like https://github.com/minio/direct-csi, however major blocker there is that it only works with additional (non-root /) disks, but DigitalOcean Premium droplets use NVMe drive as root /.

The question is: can you consider adding support for DigitalOcean NVMe drives to csi-digitalocean please? :)

Thanks!

artem-zinnatullin avatar Jun 22 '21 10:06 artem-zinnatullin

Hello,

We are considering adding support for dynamic provisioning of local storage volumes in DOKS, however it likely will not be implemented in this CSI driver.

The significant caveat to using node-local NVMe/SSD storage is that it is indeed node-local - we can't detach it from one node and attach it to another. This means it's really only useful for ephemeral purposes, since we expect nodes to be replaced in the course of normal cluster operations (e.g., due to health or for upgrade).

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Thanks!

cc @bikram20

adamwg avatar Jun 22 '21 15:06 adamwg

We are considering adding support for dynamic provisioning of local storage volumes in DOKS

That's great news!

however it likely will not be implemented in this CSI driver.

Interesting, how it'd be exposed and mounted then?

The significant caveat to using node-local NVMe/SSD storage is that it is indeed node-local - we can't detach it from one node and attach it to another. This means it's really only useful for ephemeral purposes, since we expect nodes to be replaced in the course of normal cluster operations (e.g., due to health or for upgrade).

We do understand this caveat. There are cases when it's fine, we want to run distributed Database on NVMe storage and distributed object store. Due to performance requirements we do want to use NVMes that DigitalOcean offers. In our case the applications are distributed meaning that a Node shutdown for say upgrades and is fine since other nodes will act as replicas, this is achieved via nodeAffinity rules in the app deployment so that pods of these apps are not running on same nodes that already have them running.

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Let's continue publicly in this issue, there are very little public discussions on this topic so I'd like to use this thread as an opportunity to add more information on using local NVMe drives with Kubernetes to internet :)

artem-zinnatullin avatar Jun 22 '21 16:06 artem-zinnatullin

We are considering adding support for dynamic provisioning of local storage volumes in DOKS

That's great news!

however it likely will not be implemented in this CSI driver.

Interesting, how it'd be exposed and mounted then?

We would add an additional StorageClass with a separate provisioner, potentially leveraging an existing project like the direct-csi driver you linked. There's nothing DO-specific about node-local storage, so no need to add it to the DO CSI driver.

adamwg avatar Jun 22 '21 16:06 adamwg

Sounds good!

artem-zinnatullin avatar Jun 22 '21 16:06 artem-zinnatullin

Submitted related issue on partitioning NVMe drives for DOKS nodes https://github.com/digitalocean/DOKS/issues/27, basically we can't repartition NVMe drive right now..

artem-zinnatullin avatar Jun 23 '21 07:06 artem-zinnatullin

This sort of provisioning is also useful for running your own database workloads on nodes if you need something with the local nVME performance. Yes, the storage is 'ephemeral', but that is something database management tools like zalando or stolon can take into account, especially when combined with things like pod disruption budgets.

You can implement solutions for that need today by running self-managed k8s clusters alongside a managed one, but the administration workload also multiplies accordingly in that case. Managed DOKS as of 1.20 at least is almost there with the ability to run your so_1.5_* plan node pools. If you offered a way to allow a node pool to upgrade in-place, an operator needing to run a local datastore could run it entirely in managed DOKS.

In my particular usecase, I have clients who need to run PostgreSQL services with custom extensions and replication patterns, so that disqualifies most managed SQL offerings as well, thus my interest in closing the feature gaps in managing ephemeral storage on cloud instances/droplets.

kainz avatar Aug 04 '21 04:08 kainz

Hm. Vultr has been doing NVMe for a while as default for their Managed Kubernetes solution. This is a big difference with no additional cost.

kallisti5 avatar Feb 09 '22 16:02 kallisti5

@kallisti5 What kind of workloads are you looking to run on NVMe local storage? Would you be okay with ephemeral nodes? Nodes are recycled during release upgrade.

bikram20 avatar Feb 14 '22 03:02 bikram20

@bikram20 Overall I'm trying to find a cost-effective way to leverage the standard DO instance sizes.

Running a reliable ReadWriteMany storage model is pretty difficult at Digital Ocean. My solution was longhorn storage (https://longhorn.io) since it maintains and grooms RWX replicas between all of the kubernetes nodes directly (using the massive amount of wasted space on each k8s node pool droplet saving costs (the 4vcpu / 8GiB nodes have over 100GiB which will go unused for most people using do's csi)). it also automatically backs up data to s3.

NVMe though would probably be the minimum requirement to maintain replicas within a reasonable timeframe.

DO really needs a managed storage solution that can do RWX like Gluster or NFS.

The workload itself is 300 GiB+ of software packages for Haiku (https://haiku-os.org) plus some other infrastructure.

kallisti5 avatar Feb 14 '22 13:02 kallisti5