k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Permissions for the /dev/{kfd,dri/renderXXXX} devices in containers

Open elukey opened this issue 2 years ago • 1 comments

Hi folks!

I am trying the AMD device plugin on my system, deployed as Systemd unit on Debian 11 (so not a DaemonSet, but directly on the K8s node). Everything works fine and I am able to see two devices in my test container:

  • /dev/kfd
  • /dev/dri/renderD128

I am trying to run the container with an unpriviledged user, like nobody, but I am struggling to assign the proper permissions to the above devices. In the container I see something like the following (tested via nsenter):

root@alexnet-tf-gpu-pod:/# ls -l /dev/kfd 
crw-rw---- 1 root 106 242, 0 Apr 18 15:58 /dev/kfd

root@alexnet-tf-gpu-pod:/# ls -l /dev/dri/renderD128 
crw-rw---- 1 root 106 226, 128 Apr 18 15:58 /dev/dri/renderD128

The gid 106 is the render group on the underlying "bare metal" K8s worker OS, that gets mapped to the test container, but in this way I don't have a clear way to add nobody to render or similar (in the Docker image). Is there a best practice that you can suggest?

Thanks in advance!

elukey avatar Apr 18 '23 16:04 elukey

In the securityContext for the pod, you can add supplementalGroups that the pod is run as, which I found enabled me to use the hardware.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#podsecuritycontext-v1-core

sdwilsh avatar Apr 11 '24 23:04 sdwilsh