gpu-operator
gpu-operator copied to clipboard
Nvidia-fabricmanager failed to start with “NV_WARN_NOTHING_TO_DO”
Uploading NVlinkError-fabricmanager-en1.docx… I have 1 NVLink device to connect 2 nvidia A40 graphics cards, used ubuntu20.04 system, downloaded and installed nvidia-driver-local-repo-ubuntu2004-515.105.01_1.0-1_ amd64 .deb driver from the official website, and then installed cuda11.8 (cuda-repo-ubuntu2004-11-8-local_11.8.0-520.61.05-1_amd64.deb) from the official website, After installing nvidia-fabricmanager-520_520.61.05-1_amd64.deb and nvidia-fabricmanager-dev-520_520.61.05-1_amd64.deb, start the fabricmanager service (sudo systemctl start nvidia-fabricmanager) The following error message is reported:
Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.
View the error details and report NV_WARN_NOTHING_TO_DO errors, as follows:
11月 15 14:15:47 leon-NF5468M6 systemd[1]: Starting NVIDIA fabric manager service... 11月 15 14:15:47 leon-NF5468M6 nv-fabricmanager[4177]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO] 11月 15 14:15:47 leon-NF5468M6 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE 11月 15 14:15:47 leon-NF5468M6 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. 11月 15 14:15:47 leon-NF5468M6 systemd[1]: Failed to start NVIDIA fabric manager service. ESCOC
Please help me, Thanks! @shivamerla
### Tasks
@dongkuang Just to be clear, you are not installing the driver via the container?
Can you confirm if this exists in your system?
/proc/driver/nvidia-nvswitch/devices
@dongkuang需要明确的是,您不是通过容器安装驱动程序?
你能确认一下你的系统中是否存在这个吗?
/proc/driver/nvidia-nvswitch/devices
@tariq1890 Thank you!This directory exists, but there are no files or any content inside,and I am sure I installing the driver not in via the container
as /proc/driver/nvidia-nvswitch/devices
is an empty dir, it is most likely that there are no nvswitches for this GPU
Fabricmanager service will not work if there are no nvswitch devices.
Today, I saw on the NVIDIA official website that A40 introduces that ultra fast GDDR6 memory can be expanded to 96GB through NVLink.How to install nvswitch?