gpu-operator
gpu-operator copied to clipboard
nvidia-fabricmanager failed to start
OS version : rockylinux 8.6
nvidia-fabricmanager failed to start
[root@test-rocky8-kvm63 ~]# systemctl start nvidia-fabricmanager
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.
[root@test-rocky8-kvm63 ~]# modinfo -F version nvidia
510.47.03
[root@test-rocky8-kvm63 ~]# nvidia-smi
Thu Jun 2 21:32:25 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
| N/A 51C P0 15W / N/A | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: Starting NVIDIA fabric manager service...
-- Subject: Unit nvidia-fabricmanager.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit nvidia-fabricmanager.service has begun starting up.
Jun 02 21:30:24 test-rocky8-kvm63 nv-fabricmanager[4857]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited status=1
Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: Failed to start NVIDIA fabric manager service.
-- Subject: Unit nvidia-fabricmanager.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit nvidia-fabricmanager.service has failed.
--
-- The result is failed.
tail -n 100 /var/log/fabricmanager.log
Fabric Manager Log initializing at: 6/2/2022 21:13:42.301
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Manager version 510.47.03 is running with the following configuration options
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Logging level = 4
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Logging file name/path = /var/log/fabricmanager.log
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Append to log file = 1
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Max Log file size = 1024 (MBs)
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Use Syslog file = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Manager communication ports = 16000
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Mode = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Mode Restart = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] FM Library communication bind interface = 127.0.0.1
[Jun 02 2022 21:13:42] [INFO] [tid 3740] FM Library communication unix domain socket =
[Jun 02 2022 21:13:42] [INFO] [tid 3740] FM Library communication port number = 6666
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Continue to run when facing failures = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Option when facing GPU to NVSwitch NVLink failure = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Option when facing NVSwitch to NVSwitch NVLink failure = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Option when facing NVSwitch failure = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Abort CUDA jobs when FM exits = 1
[Jun 02 2022 21:13:42] [ERROR] [tid 3740] request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Fabric Manager Log initializing at: 6/2/2022 21:15:32.357
Having exactly the same issue but with a machine with 2xA10, RHEL 8.6, driver 470.
me too 2xA100 Ubuntu 22.04 driver 510 and 515
Sorry, missed this, will try to recreate this.
@sebastianohl @EliasVansteenkiste did you install FM packages on the host directly and managing through systemctl? With GPU Operator we launch FM daemon through driver-container when NVSwitch devices are detected. Can you share logs of driver container in these cases? If driver is pre-installed on the node, then can you share system logs or output from nvidia-bug-report.sh.
@shivamerla I installed the FM package (and the driver) directly on the host because it did not work doing this via kubernetes (i don't know why).
here is the output of nvidia-bug-report.sh: nvidia-bug-report.log.gz
Had the same issue on my 8xA100 machine, removing and installing fabricmanager using yum fixed it.
@sebastianohl @EliasVansteenkiste did you install FM packages on the host directly and managing through systemctl? With GPU Operator we launch FM daemon through driver-container when NVSwitch devices are detected. Can you share logs of driver container in these cases? If driver is pre-installed on the node, then can you share system logs or output from
nvidia-bug-report.sh.
My OS version is almalinux9.0 The nvidia driver version is 515.65.01
fabric-manager still fails to start, nvidia-bug-report.sh information is as follows: nvidia-bug-report.log.gz
jag@Aigen:~$ sudo service nvidia-fabricmanager start Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.