zos icon indicating copy to clipboard operation
zos copied to clipboard

GPU Support

Open andhartl opened this issue 4 years ago • 2 comments

We get a lot of questions on GPU support from AI/ML users and farmers. So the demand is definitely there. We had the GPU support on the roadmap a while ago but I do not know where we are at right now. Can we open a discussion about it?

andhartl avatar Jun 04 '21 15:06 andhartl

I think we can start working on this now. I know it has been in the backlog for a long time. I will move it to the current active project.

Questions I need to research:

  • [ ] Working with GPUs with cloud-hypervisors
  • [ ] Can a node has multiple GPUs ?
  • [ ] Tracking of GPU(s) of the node and if it's free to be allocated by a VM This probably need to be added to node contract

muhamadazmy avatar Feb 25 '22 15:02 muhamadazmy

GPUs can be attached to a VM using cloud-hypervisor by unbinding it from its driver and then bind it to vfio driver as described here.

I have a nvidia GPU, and couldn't do dynamic unbinding/binding while the machine is running. So instead I gave vfio control over the gpu (and other neighboring devices) through kernel params as described here. I imagine it won't be necessary on the node since the gpu shouldn't be bound to any driver but I didn't get to this yet.

The part about "neighboring devices" is that the gpu belongs to an "IOMMU group" and the VM should control all devices belonging to this group, in my case it was an audio and a usb device. It's possible to bypass this but it's with risks (didn't read them yet).

The gpu appears successfully in the VM but a driver should be installed then to allow using it. AFAIK, the kernel we use doesn't allow dynamic module using. So it must be enabled to do so (or the driver should be pre-installed(?), but it looks like a complicated solution).

This all was tried on my machine, not a node. I think its kernel must be updated to include vfio support.

TLDR: Done:

  • attaching the GPU to the VM through cloud-hypervisor on a normal kernel

Next:

  • Updating the node's kernel with vfio support and trying this on it instead
  • Looking into how the GPU driver can be used inside the zmachine (By updating the kernel to allow module dynamic loading)

Notes:

  • The GPU is accompanied with neighboring devices which won't be known until runtime which might pose a security problem (we don't want a zmachine owner to control the usb device of the node).

OmarElawady avatar Mar 17 '22 09:03 OmarElawady