k8s-device-plugin
k8s-device-plugin copied to clipboard
Non-Root access to nvidia socket
1. Issue or feature description
We have a mandate that containers are not allowed to run as root in our clusters, however this is causing access issues when trying to bind to /var/lib/kubelet/device-plugins/nvidia.sock in the daemonset. I've created our own docker container that switches users before running the plugin, but it can't access the socket as its own by root. Removing the user change, and remaining as root fixes the problem.
2. Steps to reproduce the issue
- Change user to non-root in container before starting the plugin.
Common error checking:
- [x] The output of
nvidia-smi -aon your host - [x] Your docker configuration file (e.g:
/etc/docker/daemon.json) - [x] The k8s-device-plugin container logs
- [x] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet)
Additional information that might help better understand your environment and reproduce the bug:
- [x] Docker version from
docker version--> 18.09.9 - [x] Docker command, image and tag used --> custom (downloads release tag, builds go app, runs as non-root)
- [x] Kernel version from
uname -a** -->4.4.0-1105-aws** - [x] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*'--> nvidia-container-runtime 3.1.1-1 - [x] NVIDIA container library version from
nvidia-container-cli -V--> 1.0.3 - [x] NVIDIA container library logs (see troubleshooting) (see below)
Here are the logs from the pod:
nvidia-device-plugin-ctr 2020/05/06 18:56:22 Loading NVML
nvidia-device-plugin-ctr 2020/05/06 18:56:22 Fetching devices.
nvidia-device-plugin-ctr 2020/05/06 18:56:22 Starting FS watcher.
nvidia-device-plugin-ctr 2020/05/06 18:56:22 Starting OS watcher.
nvidia-device-plugin-ctr 2020/05/06 18:56:22 Could not start device plugin: remove /var/lib/kubelet/device-plugins/nvidia.sock: permission denied
nvidia-device-plugin-ctr 2020/05/06 18:56:22 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
nvidia-device-plugin-ctr 2020/05/06 18:56:22 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
nvidia-device-plugin-ctr 2020/05/06 18:56:22 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
It seems like the pod will need root access to bind to the socket, which I'm trying to avoid--any pointers?
So the problem appears that the user in the container needs to match the owner on the node. Doing a chown on /var/lib/kubelet/* and on /var/lib/kubelet/kubelet.sock works, but the next issue is that when kubelet restarts, it'll delete kubelet.sock and recreate it with root as the user...
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.