spec
spec copied to clipboard
CSI support for hypervisor container runtimes
As we've discussed this in sig-storage meeting, we would like to propose a meaningful feature to CSI spec, which aims at leveraging hypervisor based container runtimes e.g. (KataContainers, virtlet, KubeVirt etc) to use CSI in the future.
- The aim is to make it possible for runtimes like KataContainers to bypass the attach phase and go to mount phase directly, and then, Kata will mount a block device (UPDATE: and other cases as well) to the VM-based-pod directly, instead of doing bind mount which is much slower in hypervisor case.
- Currently, we (Miratis, Hyper etc) are using flexvolume as workaround, e.g. https://github.com/kubernetes/frakti/blob/master/pkg/flexvolume/flexvolume.go While this patch is not portable and can not serve general purpose since it should be bound with specific plugin (e.g. Cinder etc).
- This feature is also in the scope of Secure Runtime feature in sig-node's Q1 plan (p0). We already integrated Kata with CRI and CNI. And CSI will help us a lot to integrate Kata with containerd cri-o etc. To serve the minimal purpose, only a minor change is expected from CSI side, please refer this slides for details:
https://docs.google.com/presentation/d/1kPeia7wLqoKQI0oX4pvVdH1UpcPx3lpmFK4P_E6oiIc/edit#slide=id.p
The pseudo code of CSI change is here: https://github.com/bergwolf/spec/tree/detached_volume
We can of course schedule meeting or talk in next sync for future discussion, while this issue can be used as feature tracker.
CC: Kata maintainers @bergwolf @sameo @gnawux sig-storage @saad-ali @jingxu97 CSI @jieyu RH: @rootfs Miratis: @ivan4th
Instead of changing API fingerprint, making new APIs make easier for compatibility purpose.
In our use case, we also need Node Reserve/Release Volume to ensure volumes are only used by one node, if these volumes don't support multi-attach. I believe this also helps Kata container.
cc @fabiand
Please also take into account that delegating the whole volume consumption to the hypervisor runtime is also benfitial - i.e. if we want to let qemu directly connect to the iSCSI target.
Not sure if this issue should serve just the original request, or also the one from @rootfs and mine (which are all three different ones).
@rootfs @fabiand Thanks for bring it up. And yes, Reserve/Release Volume is useful for Kata container as well. But I think it should be a Controller API in stead, because a node can be disconnected with CO but still have access to the storage network, in which case CO needs to call Release Volume before Reserving/Publishing the volume to another node. Also IMO Reserve Volume can take an owner argument so that CO can decide who (a node or a vm on the node) shall have exclusive access to the volume. WDYT?
Instead of changing API fingerprint, making new APIs make easier for compatibility purpose.
@rootfs, how about introducing a new NodePublishDetachedVolume() API? It keeps most semantics of NodePublishVolume() except presenting/mounting the volume at target_path, which will not be included in NodePublishDetachedVolumeRequest.
@bergwolf +1
A couple of questions... and apologies if you explained this on the CSI call yesterday I had to miss the first half of it and I'm sure I don't understand the issues faced by VM based runtimes...
- Is detached mode an optimization for vm based runtimes or is it a requirement to work at all?
- Is this only for block devices?
For compatibility/extensibility purposes it may be good to give the mode it's own message type or perhaps make it an enum rather than a bool.
The ability to support the detached mode likely should also be a capability returned by the plugin.
It is both optimized (through QEMU block without have to attach the volume to the host) and required (for isolation) mode for VM and limited to block devices.
@bergwolf Detached is probably ambiguous here - the volume may never be attached in the first place.
@rootfs @cpuguy83 It is not limited to block devices. We have implemented NFS support and SMB can be added as well. In theory any remote storage can be added in detached mode. There is an agent program in Kata container that can help storage setup directly in the guest.
@rootfs Detached is in contrast with NodePublishVolume() that IIUC always attaches the volume to the host.
Yep, as @bergwolf says, this could also relevant for file-mode (specifically nfs).
For KuebVrit however, we are primarily interested in delegating the block storage attach to qemu. (not file).
Will address post v1. Related issue by @cpuguy83 about letting CO control mount -- should align these designs
@saad-ali That's great. Has the issue been sent out?
@resouer I believe that Saad was referring to #96
CC @xing-yang @jingxu97
Any progress on this issue? Would love CSI support for hypervisor runtimes like kata.