gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

WSL2 Support

Open mchikyt3 opened this issue 3 years ago • 10 comments

Hi, I wonder if it's possible to use the gpu-operator in a single-node Microk8s cluster hosted on a wsl2 Ubuntu distribution. Thanks.

mchikyt3 avatar Feb 02 '22 10:02 mchikyt3

@elezar to comment if this is supported by our container-toolkit.

shivamerla avatar Mar 25 '22 19:03 shivamerla

Hi @mchikyt3 the combination you mention is untested by us, and I cannot provide a concrete answer.

The NVIDIA Container Toolkit which ensures that a launched container includes the required devices and libraries to use GPUs in the container does offer some support WSL2. It should be noted however, that there may be some use cases that do not work as expected.

Also note that I am not sure whether the other operands such as GPU Feature Discovery or the NVIDIA Device Plugin will function as expected.

elezar avatar Mar 28 '22 04:03 elezar

It appears that GPU Feature Discovery does not work properly. @elezar, are there any plans to address this? I have no problems running CUDA code inside Docker containers on WSL2 with Docker, podman, but it doesn't work with several Kubernetes distributions I tried. I posted several logs from my laptop on this MicroK8s thread and would be grateful if someone could help me to solve this issue.

Maybe the problem could be solved by creation of a couple of symlinks.

valxv avatar Jul 07 '23 10:07 valxv

Can someone fix this?

(combined from similar events): Error: failed to generate container "0520a1a018b798ce299be6171c3daa405d549219457b6c1e42cb1774b1b92e9e" spec: failed to generate spec: path "/" is mounted on "/" but it is not a shared or slave mount

https://github.com/NVIDIA/gpu-operator/blob/2f0a16684157a9171939702a8b5322363c6d93e9/assets/state-container-toolkit/0500_daemonset.yaml#L110-L112

This is not working in WSL2, I confirmed this on k0s

EDIT: fixed, just run mount --make-rshared /

wizpresso-steve-cy-fan avatar Jul 25 '23 06:07 wizpresso-steve-cy-fan

Also make sure you edit the labels to cheat the GPU operator on the specific WSL2 node:

    feature.node.kubernetes.io/pci-10de.present: 'true'
    nvidia.com/device-plugin.config: RTX-4070-Ti # needed because GFD is not available
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true' # optional
    nvidia.com/gpu.deploy.dcgm-exporter: 'true' # optional
    nvidia.com/gpu.deploy.device-plugin: 'true' 
    nvidia.com/gpu.deploy.driver: 'false' # need special treatments
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'false' # incompatible with WSL2
    nvidia.com/gpu.deploy.node-status-exporter: 'false' # optional
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.present: 'true'
    nvidia.com/gpu.replicas: '16'

You can either auto-insert those labels if you use k0sctl or add them manually once the node is onboarded.

The drivers and container-toolkit are technically optional as WSL2 already installed all the prerequisites...But we still need to cheat the system. We will need to make the following files:

$ touch /run/nvidia/validations/host-driver-ready
$ touch /run/nvidia/validations/toolkit-ready # if you skipped validator
$ touch /run/nvidia/validations/cuda-ready # if you skipped validator
$ touch /run/nvidia/validations/plugin-ready # if you skipped validator

So that we could effectively bypass the GPU operator checkings, then the GPU operator will finally register the node to be compatible with nvidia runtime and runs it. You can try to use a DaemonSet script to do that.

It is also noted if you have preinstalled drivers then you don't need to touch the files in any case. But then you need to figure out how to pass the condition.

I'm using k0s in my company local cluster under WSL2, but this should apply to all k8s distributions that runs under WSL2.

By the way this is the Helm install for GPU operator that should work on k0s:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoSchedule
    key: k8s.wizpresso.com/wsl-node
    operator: Exists
devicePlugin:
  config:
    name: time-slicing-config
driver:
  enabled: true
operator:
  defaultRuntime: containerd
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_CONFIG
    value: /etc/k0s/containerd.d/nvidia.toml
  - name: CONTAINERD_SOCKET
    value: /run/k0s/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"

wizpresso-steve-cy-fan avatar Oct 06 '23 09:10 wizpresso-steve-cy-fan

@wizpresso-steve-cy-fan , you wouldn't happen to have an install doc or script you could share for getting k0s set up on wsl2 by any chance?

AntonOfTheWoods avatar Nov 05 '23 06:11 AntonOfTheWoods

@AntonOfTheWoods let me push the changes to GitLab first https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881 https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/481

wizpresso-steve-cy-fan avatar Nov 06 '23 03:11 wizpresso-steve-cy-fan

@AntonOfTheWoods see comment for full instructions on how to make this work locally

  • Windows 11
  • WSL2
  • Docker cgroup v2
  • Nvidia GPU operator
  • Kubeflow

on kind or qbo Kubernetes

alexeadem avatar Jan 30 '24 20:01 alexeadem

@alexeadem thanks so much for this! i was dragging my feet to create images for wizpresso-steve-cy-fan's prs so this saved me some time.

i am curious to know if you've had success with running a cuda workload with this implemented? i am able to successfully get the gpu operator helm chart running with these values:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
devicePlugin:
  image: k8s-device-plugin
  repository: eadem
  version: v0.14.3-ubuntu20.04
driver:
  enabled: true
operator:
  defaultRuntime: containerd
  image: gpu-operator
  repository: eadem
  version: v23.9.1-ubi8
runtimeClassName: "nvidia"
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"
  image: container-toolkit
  repository: eadem
  version: 1.14.3-ubuntu20.04
validator:
  driver:
    env:
    - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
      value: "true"
  image: gpu-operator-validator
  repository: eadem
  version: v23.9.1-ubi8

and my pods are now successfully getting past preemption when specifying gpu limits, however, when i try to run a gpu workload (e.g. nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04)) it fails to run with the error:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

just curious to know if you're having this problem or not.

oh, and not that it matters much, but just a heads up that the custom docker image you linked for the operator in your comment is actually linking to your custom validator image.

thanks again!

cbrendanprice avatar Jan 30 '24 23:01 cbrendanprice

@alexeadem thanks so much for this! i was dragging my feet to create images for wizpresso-steve-cy-fan's prs so this saved me some time.

i am curious to know if you've had success with running a cuda workload with this implemented? i am able to successfully get the gpu operator helm chart running with these values:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
devicePlugin:
  image: k8s-device-plugin
  repository: eadem
  version: v0.14.3-ubuntu20.04
driver:
  enabled: true
operator:
  defaultRuntime: containerd
  image: gpu-operator
  repository: eadem
  version: v23.9.1-ubi8
runtimeClassName: "nvidia"
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"
  image: container-toolkit
  repository: eadem
  version: 1.14.3-ubuntu20.04
validator:
  driver:
    env:
    - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
      value: "true"
  image: gpu-operator-validator
  repository: eadem
  version: v23.9.1-ubi8

and my pods are now successfully getting past preemption when specifying gpu limits, however, when i try to run a gpu workload (e.g. nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04)) it fails to run with the error:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

just curious to know if you're having this problem or not.

oh, and not that it matters much, but just a heads up that the custom docker image you linked for the operator in your comment is actually linking to your custom validator image.

thanks again!

np @cbrendanprice Thanks for the docker links. I fixed it.

Nvidia driver, cuda, toolkit and operator are pretty tight together when it comes to versions. That error should be easily fixed by using the right versions. Here is an example of the version needed for cuda 12.2 and a full example of a cuda workload in kubeflow and directly into a pod in the operator. And I don't see that error with eadem images.

https://ce.qbo.io/#/ai_and_ml

Try this one instead. The link you provided is an old version https://ce.qbo.io/#/ai_and_ml?id=_3-deploy-vector-add

alexeadem avatar Jan 30 '24 23:01 alexeadem