workspaces-issues icon indicating copy to clipboard operation
workspaces-issues copied to clipboard

[Bug] - NVIDIA UnRaid - Unable to recognize GPU

Open mfoti opened this issue 1 year ago • 2 comments

Existing Resources

  • [x] Please search the existing issues for related problems
  • [x] Consult the product documentation : Docs
  • [x] Consult the FAQ : FAQ
  • [x] Consult the Troubleshooting Guide : Guide
  • [x] Reviewed existing training videos: Youtube

Describe the bug On UnRaid installation Wizard the GPU is not recognized.

I've updated the installation template including: in extra parameters: --runtime=nvidia as variable: NVIDIA_VISIBLE_DEVICES = all as variable: NVIDIA_DRIVER_CAPABILITIES = all and nvidia-smi works.

# docker exec kasm nvidia-smi

Wed Mar 20 11:59:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.40.07              Driver Version: 550.40.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P400                    Off |   00000000:AF:00.0 Off |                  N/A |
| 56%   54C    P0             N/A /  N/A  |       0MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

but drm_info don't include /dev/dri/card1 my GPU Dev

# docker exec kasm ./gpuinfo.sh`

{"/dev/dri/card0":"MGA G200 SE"}

I've tried to force this card during the installation process (with an hardcoded mod of this script that output: {"/dev/dri/card1":"NVIDIA P400"} and {"/dev/dri/card1":"Quadro P400"} ), but after installation was done I'm unable to start any workspace, I have the error:

error gathering device information while adding custom device "/dev/dri/renderD129": no such file or directory

Full log:

Error during Create request for Server(a89aa3ec-ede1-4152-8a43-1dc99cb1950b) : (Exception creating Kasm: Traceback (most recent call last):
  File "docker/api/client.py", line 268, in _raise_for_status
  File "requests/models.py", line 1021, in raise_for_status
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.44/containers/f683e85b8fb6c257831f3a664eac0adc36d1ccfcd8f63075d69f732c88c9765f/start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "__init__.py", line 573, in post
  File "provision.py", line 1871, in provision
  File "provision.py", line 1863, in provision
  File "docker/models/containers.py", line 818, in run
  File "docker/models/containers.py", line 404, in start
  File "docker/utils/decorators.py", line 19, in wrapped
  File "docker/api/container.py", line 1111, in start
  File "docker/api/client.py", line 270, in _raise_for_status
  File "docker/errors.py", line 31, in create_api_error_from_http_exception
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.44/containers/f683e85b8fb6c257831f3a664eac0adc36d1ccfcd8f63075d69f732c88c9765f/start: Internal Server Error ("error gathering device information while adding custom device "/dev/dri/renderD129": no such file or directory")
)

The device is not present in kasm_agent container as device:

# docker exec kasm_agent ls /dev/dri/card1

ls: cannot access '/dev/dri/card1': No such file or directory

# docker exec kasm_agent ls /dev/dri/renderD129

ls: cannot access '/dev/dri/renderD129': No such file or directory

But I can find it in proc:

# docker exec kasm_agent cat /proc/driver/nvidia/gpus/0000\:af\:00.0/information

Model: 		 Quadro P400
IRQ:   		 304
GPU UUID: 	 GPU-226266ed-48f0-0e03-4d64-780bc2e08ccb
Video BIOS: 	 86.07.8f.00.02
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:af:00.0
Device Minor: 	 0
GPU Excluded:	 No

To Reproduce Steps to reproduce the behavior:

  1. Add kasm App from UnRaid installation
  2. Open the select for GPU
  3. You will not find any NVIDIA Card

Expected behavior Be able to use nvidia card on kasm/UnRaid

Workspaces Version Version 1.15

Workspaces Installation Method UnRaid

Workspace Server Information (please provide the output of the following commands):

  • uname -a
Linux fe5d658a8112 6.1.74-Unraid #1 SMP PREEMPT_DYNAMIC Fri Feb  2 11:06:32 PST 2024 x86_64 x86_64 x86_64 GNU/Linux
  • cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
  • sudo docker info
Client: Docker Engine - Community
Version:    25.0.4
Context:    default
Debug Mode: false
Plugins:
 compose: Docker Compose (Docker Inc.)
   Version:  v2.5.0
   Path:     /usr/local/lib/docker/cli-plugins/docker-compose

Server:
Containers: 9
 Running: 8
 Paused: 0
 Stopped: 1
Images: 9
Server Version: 25.0.4
Storage Driver: fuse-overlayfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
 seccomp
  Profile: builtin
 cgroupns
Kernel Version: 6.1.74-Unraid
Operating System: Ubuntu 22.04.2 LTS (containerized)
OSType: linux
Architecture: x86_64
CPUs: 88
Total Memory: 251.5GiB
Name: fe5d658a8112
ID: d62537f3-97b0-482e-a489-4e00a573cd4c
Docker Root Dir: /opt/docker
Debug Mode: false
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
  • sudo docker ps | grep kasm
4feeb4b6b2cf   kasmweb/nginx:1.25.3       "/docker-entrypoint.…"   15 hours ago   Up 14 hours                        80/tcp, 0.0.0.0:6333->6333/tcp   kasm_proxy
261d67c5ccc3   kasmweb/agent:1.15.0       "/bin/sh -c '/usr/bi…"   15 hours ago   Up 14 hours (healthy)              4444/tcp                         kasm_agent
ad3e62cd7871   kasmweb/share:1.15.0       "/bin/sh -c '/usr/bi…"   15 hours ago   Up 14 hours (healthy)              8182/tcp                         kasm_share
b1f718129357   kasmweb/kasm-guac:1.15.0   "/dockerentrypoint.sh"   15 hours ago   Up 16 seconds (health: starting)                                    kasm_guac
6150582c13bb   kasmweb/api:1.15.0         "/bin/sh -c '/usr/bi…"   15 hours ago   Up 14 hours (healthy)              8080/tcp                         kasm_api
a95638e0e39a   kasmweb/manager:1.15.0     "/bin/sh -c '/usr/bi…"   15 hours ago   Up 14 hours (healthy)              8181/tcp                         kasm_manager
bdfc0ef3df36   redis:5-alpine             "docker-entrypoint.s…"   15 hours ago   Up 14 hours                        6379/tcp                         kasm_redis
8436c39024bc   postgres:12-alpine         "docker-entrypoint.s…"   15 hours ago   Up 14 hours (healthy)              5432/tcp                         kasm_db

Additional context I'd like to try to add this to my boot modprobe config:

cat /boot/config/modprobe.d/nvidia.conf
options nvidia-drm modeset=1
options nvidia-drm fbdev=1

but I need to shutdown the server and is not something I can do easily

mfoti avatar Mar 20 '24 11:03 mfoti

I've fixed running this:

docker exec -ti kasm nvidia-ctk runtime configure --runtime=docker
docker restart kasm

and updating the Chrome Workspace in "Docker Run Config Override (JSON)"

with this configuration:

{
  "device_requests": [
    {
      "capabilities": [
        [
          "gpu"
        ]
      ],
      "count": -1,
      "device_ids": null,
      "driver": "",
      "options": {}
    }
  ],
  "devices": [
    "/dev/dri/card1:/dev/dri/card1:rwm",
    "/dev/dri/renderD128:/dev/dri/renderD128:rwm"
  ],
  "environment": {
    "KASM_EGL_CARD": "/dev/dri/card1",
    "KASM_RENDERD": "/dev/dri/renderD128"
  },
  "hostname": "kasm"
}

But I have a black screen and at least chrome doesn't starts

mfoti avatar Mar 20 '24 23:03 mfoti

But I have a black screen and at least chrome doesn't starts

Remove your Docker run config with:

{ "environment": { "NVIDIA_DRIVER_CAPABILITIES": "all" } }

I think you had it - had to scrounge around to figure out what the issues were but step 1 is:

Add the variables to the container:

Variables:

NVIDIA_DRIVER_CAPABILITIES=all NVIDIA_VISIBLE_DEVICES=all (or GPUID on visible devices)

Argument: --runtime=nvidia

Command: docker exec -ti kasm nvidia-ctk runtime configure --runtime=docker (as long as container name is kasm - run it from the CLI of the host, or alternatively run nvidia-ctk runtime configure --runtime=docker within the container.

Set the docker json to:

{ "environment": { "NVIDIA_DRIVER_CAPABILITIES": "all" } }

tknz avatar Jun 23 '24 04:06 tknz