k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Non-Root access to nvidia socket

Open mwm5945 opened this issue 5 years ago • 1 comments

1. Issue or feature description

We have a mandate that containers are not allowed to run as root in our clusters, however this is causing access issues when trying to bind to /var/lib/kubelet/device-plugins/nvidia.sock in the daemonset. I've created our own docker container that switches users before running the plugin, but it can't access the socket as its own by root. Removing the user change, and remaining as root fixes the problem.

2. Steps to reproduce the issue

  1. Change user to non-root in container before starting the plugin.

Common error checking:

  • [x] The output of nvidia-smi -a on your host
  • [x] Your docker configuration file (e.g: /etc/docker/daemon.json)
  • [x] The k8s-device-plugin container logs
  • [x] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • [x] Docker version from docker version --> 18.09.9
  • [x] Docker command, image and tag used --> custom (downloads release tag, builds go app, runs as non-root)
  • [x] Kernel version from uname -a ** --> 4.4.0-1105-aws**
  • [x] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*' --> nvidia-container-runtime 3.1.1-1
  • [x] NVIDIA container library version from nvidia-container-cli -V --> 1.0.3
  • [x] NVIDIA container library logs (see troubleshooting) (see below)

Here are the logs from the pod:

 nvidia-device-plugin-ctr 2020/05/06 18:56:22 Loading NVML
 nvidia-device-plugin-ctr 2020/05/06 18:56:22 Fetching devices.
 nvidia-device-plugin-ctr 2020/05/06 18:56:22 Starting FS watcher.
 nvidia-device-plugin-ctr 2020/05/06 18:56:22 Starting OS watcher.
 nvidia-device-plugin-ctr 2020/05/06 18:56:22 Could not start device plugin: remove /var/lib/kubelet/device-plugins/nvidia.sock: permission denied
 nvidia-device-plugin-ctr 2020/05/06 18:56:22 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
 nvidia-device-plugin-ctr 2020/05/06 18:56:22 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
 nvidia-device-plugin-ctr 2020/05/06 18:56:22 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

It seems like the pod will need root access to bind to the socket, which I'm trying to avoid--any pointers?

mwm5945 avatar May 06 '20 19:05 mwm5945

So the problem appears that the user in the container needs to match the owner on the node. Doing a chown on /var/lib/kubelet/* and on /var/lib/kubelet/kubelet.sock works, but the next issue is that when kubelet restarts, it'll delete kubelet.sock and recreate it with root as the user...

mwm5945 avatar May 07 '20 15:05 mwm5945

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Mar 31 '24 04:03 github-actions[bot]