guunergooner
guunergooner
Check repository permissions with project visibility is public
> /kind bug > /kind cgroup > > ### 1. Issue or feature description > When the server restarts, nvidia-device-plugin-daemonset cannot load the nvidia-uvm/nvidia-modeset device correctly > > ### 2....
@klueska @RenaudWasTaken @nvjmayo This issue also can be reproduced on P40
After debugging, I found that device nvidia-uvm/nvidia-modeset/nvidia-uvm-tools would be loaded after Docker was started, so the nvidia-device-plugin container could not load nvidia-uvm/nvidia-modeset/nvidia-uvm-tools devices * After reboot server, nvidia-device-plugin container cannot...
* Add nvidia udev rules, docker started after systemd-udev-trigger can workaround this issue ```shell root@k8s-t4-node:~$ cat /etc/udev/rules.d/71-nvidia.rules # Load and unload nvidia-modeset module SUBSYSTEM=="module", ACTION=="add", DEVPATH=="/module/nvidia", RUN+="/usr/bin/nvidia-modprobe -m" SUBSYSTEM=="module", ACTION=="remove",...
@klueska Thanks, This is `/etc/nvidia-container-runtime/config.toml` configuration file on my machine. I now understand the cause of the problem, which can also be circumvented by the above configuration file `/etc/udev/rules.d/71-nvidia.rules` ```shell...
> Yes,if you want to use 7618MiB, you should change the unit into `MiB` in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28. * I change the unit into MiB, and recreate device-plugin-ds,find node kubelet.service report grpc...
> I think it's due to grpc max msg size. If you'd like to fix, it should be similar to [helm/helm#3514](https://github.com/helm/helm/pull/3514). it can't fix mine problem. i review gpushare-device-plugin proj...
> I mean you can increase the default grpc max msg size in source code of Kubelet and device plugin to 16MB, and compile them to new binary then deploy....