workspaces-issues GPU / render / video access not always working (permissions?)

Hi,

I ran into some kind of issue regarding GPU support.

image: kasmweb/ubuntu-focal-desktop:1.11.0 and anything built on kasmweb/core-ubuntu-focal:1.11.0 Host-OS: ubuntu-server 22.04 LTS

Expected behavior:

Images for kasm, regardless of being executed standalone or through kasm, should have access to /dev/dri/card0 and and /dev/dri/renderD128, after reboot "old" kasm-sessions should work as before

Observed behavior:

after reboot, standalone execution does not have access. glxinfo -B shows software/MESA rendering, no process in nvidia-smi
launching the same image through kasm will make it work, glxinfo -B shows my NVIDIA card, nvidia-smi shows xfce4-session
any subsequently launched standalone image works, glxinfo -B shows NVDIA card and nvidia-smi shows additional xfce4-session
after reboot, reentering the old kasm session: glxinfo -B shows software/MESA rendering, no processes in nvidia-smi
launching an additional kasm-session does not work due to lack of resources, which would be expected provided the old session still occupies them
launching standalone reverts back software/MESA
deleting the old session and starting a new one will again "fix" it, until reboot.

Other observations:

kasm-session after reboot "looses" access to NVENC, standalone sessions will always have NVENC in OBS, regardless of access to gpu-rendering
vglrun -d /dev/dri/card0 glxinfo -B will show normal NVIDIA info in all working instaces/sessions, on non-working sessions it outputs "libEGL warning: failed to open /dev/dri/renderD128: Permission denied"

I assume this is some kind of permissions issue, however I could not find a fix/solution. I believe the kasm-agent somehow fixes it when initially launching a gpu-accelerated session, but handles it differently after reboot while still believing that the ressources are allocated. Main concern is that in case of a reboot of the host/server-OS, the user on the GPU-accelerated session is forced to delete the session and create a new one to regain hardware acceleration. However, for testing/development purposes it would also come in handy if it is possible to launch an image without having to "preinitialize" the host by launching a gpu-accelerated image through kasm.

I am not that much of a linux/permissions guru, therefore any idea for a fix or workaround is highly appreciated.

Best,

H.

Oct 01 '22 19:10 happy-dirac

What I've found so far is that running the container as root user (USER root in dockerfile) fixes it 100%, but there has to be a smarter solution :)

Oct 02 '22 01:10 happy-dirac

Hello,

I'll primarily address when the sessions are being executed via Workspaces. There are a number of particular things we do when launching GPU sessions that are required to make the GPU acceleration via VirtualGL function. We will eventually get around to documenting what needs to be done for standalone sessions.

So for now, I'd like to ensure I understand what you are seeing when running GPU sessions in Workspaces.

When starting a GPU session as a standard unprivileged user (UID/GID 1000) , the Workspaces Agent will do a chown 1000 on the card and render devices (e.g /dev/dri/card0, /dev/dri/renderD128) to ensure the user can properly access them.

What is likely happening is the permissions on the devices are getting reset when you reboot, therefore when the container starts again the standard user cant access them. The Kasm Agent, doesn't actually restart or recreate the container upon reboot , docker does because of the default restart policy. So those permission changes are not occurring. In our testing (which was primarily scoped to Ubuntu 20.04) , it was inconsistent if the permissions changes were actually needed. I suspect there are other OS settings (e.g umask) and/or driver versions that impact this.

On your system it seems like it is necessary which explains while running the container as root fixes the issue.

Generally speaking, in the Workspaces ecosystem unexpected server reboots are seen as a fault. Primarily because this will always be disruptive to the user and they will lose their state since the processes and memory of whatever they were working on will be wiped.

There are options for planned maintenance like disabling an Agent so new sessions get provisioned to others so you can take the first one offline and patch etc.

Oct 02 '22 16:10 j-travis

equipment failures can happen. there needs to be a way to fix this.

I recently saw this when first testing kasm in a single node setup. we had kasm running just fine. we rebooted as we did a kernel update and when it came back up it was throwing an error about the containers not being able to start due to missing dev paths

Feb 20 '24 17:02 ITJamie