gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

nvidia-fabricmanager failed to start

Open erictarrence opened this issue 3 years ago • 10 comments

OS version : rockylinux 8.6

nvidia-fabricmanager failed to start


[root@test-rocky8-kvm63 ~]#  systemctl start nvidia-fabricmanager
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.


[root@test-rocky8-kvm63 ~]#  modinfo -F version nvidia
510.47.03
[root@test-rocky8-kvm63 ~]#    nvidia-smi
Thu Jun  2 21:32:25 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
| N/A   51C    P0    15W /  N/A |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: Starting NVIDIA fabric manager service...
-- Subject: Unit nvidia-fabricmanager.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-fabricmanager.service has begun starting up.
Jun 02 21:30:24 test-rocky8-kvm63 nv-fabricmanager[4857]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited status=1
Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
Jun 02 21:30:24 test-rocky8-kvm63 systemd[1]: Failed to start NVIDIA fabric manager service.
-- Subject: Unit nvidia-fabricmanager.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-fabricmanager.service has failed.
-- 
-- The result is failed.


tail -n 100 /var/log/fabricmanager.log 
Fabric Manager Log initializing at: 6/2/2022 21:13:42.301
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Manager version 510.47.03 is running with the following configuration options
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Logging level = 4
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Logging file name/path = /var/log/fabricmanager.log
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Append to log file = 1
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Max Log file size = 1024 (MBs)
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Use Syslog file = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Manager communication ports = 16000
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Mode = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Fabric Mode Restart = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] FM Library communication bind interface = 127.0.0.1
[Jun 02 2022 21:13:42] [INFO] [tid 3740] FM Library communication unix domain socket = 
[Jun 02 2022 21:13:42] [INFO] [tid 3740] FM Library communication port number = 6666
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Continue to run when facing failures = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Option when facing GPU to NVSwitch NVLink failure = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Option when facing NVSwitch to NVSwitch NVLink failure = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Option when facing NVSwitch failure = 0
[Jun 02 2022 21:13:42] [INFO] [tid 3740] Abort CUDA jobs when FM exits = 1
[Jun 02 2022 21:13:42] [ERROR] [tid 3740] request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Fabric Manager Log initializing at: 6/2/2022 21:15:32.357

erictarrence avatar Jun 02 '22 13:06 erictarrence

Having exactly the same issue but with a machine with 2xA10, RHEL 8.6, driver 470.

EliasVansteenkiste avatar Jun 10 '22 15:06 EliasVansteenkiste

me too 2xA100 Ubuntu 22.04 driver 510 and 515

sebastianohl avatar Jun 17 '22 12:06 sebastianohl

Sorry, missed this, will try to recreate this.

shivamerla avatar Jul 01 '22 16:07 shivamerla

@sebastianohl @EliasVansteenkiste did you install FM packages on the host directly and managing through systemctl? With GPU Operator we launch FM daemon through driver-container when NVSwitch devices are detected. Can you share logs of driver container in these cases? If driver is pre-installed on the node, then can you share system logs or output from nvidia-bug-report.sh.

shivamerla avatar Jul 05 '22 19:07 shivamerla

@shivamerla I installed the FM package (and the driver) directly on the host because it did not work doing this via kubernetes (i don't know why).

here is the output of nvidia-bug-report.sh: nvidia-bug-report.log.gz

sebastianohl avatar Jul 20 '22 14:07 sebastianohl

Had the same issue on my 8xA100 machine, removing and installing fabricmanager using yum fixed it.

Anuj-Chauhan avatar Aug 07 '22 22:08 Anuj-Chauhan

@sebastianohl @EliasVansteenkiste did you install FM packages on the host directly and managing through systemctl? With GPU Operator we launch FM daemon through driver-container when NVSwitch devices are detected. Can you share logs of driver container in these cases? If driver is pre-installed on the node, then can you share system logs or output from nvidia-bug-report.sh.

My OS version is almalinux9.0 The nvidia driver version is 515.65.01

fabric-manager still fails to start, nvidia-bug-report.sh information is as follows: nvidia-bug-report.log.gz

erictarrence avatar Nov 02 '22 10:11 erictarrence

jag@Aigen:~$ sudo service nvidia-fabricmanager start Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.

JohnTesla avatar Jun 29 '23 08:06 JohnTesla