open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

`nvidia-smi` fails with `No devices were found` on RTX 5090 / GB202

Open ikr7 opened this issue 6 months ago • 3 comments

NVIDIA Open GPU Kernel Modules Version

570.153.02

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.14.7-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 22 May 2025 05:37:49 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

01:00.0 VGA compatible controller: NVIDIA Corporation GB202 [GeForce RTX 5090] (rev a1) (as nvidia-smi fails I copied the output of lspci)

Describe the bug

When I execute nvidia-smi command without any arguments, it fails with No devices were found error after roughly 5 seconds of delay.

To Reproduce

  1. run nvidia-smi on a shell

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

pretty sure that the information below are included in nvidia-bug-report.log.gz but here's what I've gathered so far:

dmesg:

$ dmesg -H | grep -iE 'nvrm|gsp|xid'
[  +8.679149] NVRM: GPU at PCI:0000:01:00: GPU-343788c6-573a-6251-02ae-4287220a092b
[  +0.000003] NVRM: Xid (PCI:0000:01:00): 143, Error status 0x65 while polling for FSP boot complete, 0x13, 0x56, 0x0, 0x0, 0x2
[  +0.000005] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspWaitForGfwBootOk_HAL(pGpu, pKernelGsp) @ kernel_gsp.c:3676
[  +0.000028] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[  +0.001812] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1859)
[  +0.001485] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

lsmod:

$ lsmod | grep -iE 'nvidia|nouv'
nvidia_drm            139264  0
nvidia_modeset       2158592  1 nvidia_drm
drm_ttm_helper         16384  1 nvidia_drm
nvidia_uvm           3940352  0
nvidia              13119488  2 nvidia_uvm,nvidia_modeset
video                  81920  1 nvidia_modeset

modinfo:

$ modinfo nvidia
filename:       /lib/modules/6.14.7-arch2-1/extramodules/nvidia.ko.zst
import_ns:      DMA_BUF
alias:          char-major-195-*
version:        570.153.02
supported:      external
license:        Dual MIT/GPL
firmware:       nvidia/570.153.02/gsp_tu10x.bin
firmware:       nvidia/570.153.02/gsp_ga10x.bin
srcversion:     9C27A8B290453A7640E09FB
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:
name:           nvidia
retpoline:      Y
vermagic:       6.14.7-arch2-1 SMP preempt mod_unload
parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
parm:           NVreg_ResmanDebugLevel:int
parm:           NVreg_RmLogonRC:int
parm:           NVreg_ModifyDeviceFiles:int
parm:           NVreg_DeviceFileUID:int
parm:           NVreg_DeviceFileGID:int
parm:           NVreg_DeviceFileMode:int
parm:           NVreg_InitializeSystemMemoryAllocations:int
parm:           NVreg_UsePageAttributeTable:int
parm:           NVreg_EnablePCIeGen3:int
parm:           NVreg_EnableMSI:int
parm:           NVreg_EnableStreamMemOPs:int
parm:           NVreg_RestrictProfilingToAdminUsers:int
parm:           NVreg_PreserveVideoMemoryAllocations:int
parm:           NVreg_EnableS0ixPowerManagement:int
parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
parm:           NVreg_DynamicPowerManagement:int
parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
parm:           NVreg_EnableGpuFirmware:int
parm:           NVreg_EnableGpuFirmwareLogs:int
parm:           NVreg_OpenRmEnableUnsupportedGpus:int
parm:           NVreg_EnableUserNUMAManagement:int
parm:           NVreg_MemoryPoolSize:int
parm:           NVreg_KMallocHeapMaxSize:int
parm:           NVreg_VMallocHeapMaxSize:int
parm:           NVreg_IgnoreMMIOCheck:int
parm:           NVreg_NvLinkDisable:int
parm:           NVreg_EnablePCIERelaxedOrderingMode:int
parm:           NVreg_RegisterPCIDriver:int
parm:           NVreg_EnableResizableBar:int
parm:           NVreg_EnableDbgBreakpoint:int
parm:           NVreg_EnableNonblockingOpen:int
parm:           NVreg_RegistryDwords:charp
parm:           NVreg_RegistryDwordsPerDevice:charp
parm:           NVreg_RmMsg:charp
parm:           NVreg_GpuBlacklist:charp
parm:           NVreg_TemporaryFilePath:charp
parm:           NVreg_ExcludedGpus:charp
parm:           NVreg_DmaRemapPeerMmio:int
parm:           NVreg_RmNvlinkBandwidth:charp
parm:           NVreg_RmNvlinkBandwidthLinkCount:int
parm:           NVreg_ImexChannelCount:int
parm:           NVreg_CreateImexChannel0:int
parm:           NVreg_GrdmaPciTopoCheckOverride:int
parm:           rm_firmware_active:charp

Not sure if it's related, but I installed the latest version of linux-firmware (specifically commit 3fbaee27) from kernel.org so I've got a few files under /usr/lib/firmware/nvidia/gb202/gsp/:

$ ls -lh /usr/lib/firmware/nvidia/gb202/gsp/
total 392K
-rw-r--r-- 1 root root 195K May 24 03:14 bootloader-570.144.bin.zst
-rw-r--r-- 1 root root 196K May 24 03:14 fmc-570.144.bin.zst
lrwxrwxrwx 1 root root   35 May 24 03:14 gsp-570.144.bin.zst -> ../../ga102/gsp/gsp-570.144.bin.zst

I also tested with several other combinations of driver and kernel versions, but none of them worked as expected (some of them resulted in different error though).

driver / kernel 570.133.07 570.144 570.153.02 575.51.02
6.14.4 NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. not tested not tested not tested
6.14.6 not tested No devices were found not tested No devices were found
6.14.7 not tested NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. No devices were found. (the reported setup) No devices were found

ikr7 avatar May 26 '25 08:05 ikr7

@ikr7 what is the device id of your gpu? Please post the output of lspci -vs 01:00.0

The only IDs supported by the open source driver so far are 2B85, 2B87 and 2C18, 2C58 for notebooks:

src/nvidia/generated/g_nv_name_released.h:5417:    { 0x2B85, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090" },
src/nvidia/generated/g_nv_name_released.h:5418:    { 0x2B87, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090 D" },
src/nvidia/generated/g_nv_name_released.h:5429:    { 0x2C18, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090 Laptop GPU" },
src/nvidia/generated/g_nv_name_released.h:5431:    { 0x2C58, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090 Laptop GPU" },

elsaco avatar May 26 '25 18:05 elsaco

@ikr7 what is the device id of your gpu? Please post the output of lspci -vs 01:00.0

The only IDs supported by the open source driver so far are 2B85, 2B87 and 2C18, 2C58 for notebooks:

src/nvidia/generated/g_nv_name_released.h:5417:    { 0x2B85, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090" },
src/nvidia/generated/g_nv_name_released.h:5418:    { 0x2B87, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090 D" },
src/nvidia/generated/g_nv_name_released.h:5429:    { 0x2C18, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090 Laptop GPU" },
src/nvidia/generated/g_nv_name_released.h:5431:    { 0x2C58, 0x0000, 0x0000, "NVIDIA GeForce RTX 5090 Laptop GPU" },

That does not show pci id, do lspci -vnns 01:00.0 instead

foxwhite25 avatar May 26 '25 18:05 foxwhite25

@elsaco @foxwhite25

Here's the output of lspci -vnns 01:00.0 (ran as root); seems the card should be supported by the driver.

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1761]
        Flags: bus master, fast devsel, latency 0, IRQ 68
        Memory at f8000000 (32-bit, non-prefetchable) [size=64M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at f000 [size=128]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
        Capabilities: [60] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [9c] Vendor Specific Information: Len=14 <?>
        Capabilities: [b0] MSI-X: Enable- Count=9 Masked-
        Capabilities: [100] Secondary PCI Express
        Capabilities: [12c] Latency Tolerance Reporting
        Capabilities: [134] Physical Resizable BAR
        Capabilities: [140] Virtual Resizable BAR
        Capabilities: [14c] Data Link Feature <?>
        Capabilities: [158] Physical Layer 16.0 GT/s <?>
        Capabilities: [188] Physical Layer 32.0 GT/s <?>
        Capabilities: [1b8] Advanced Error Reporting
        Capabilities: [200] Lane Margining at the Receiver
        Capabilities: [248] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [250] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [290] L1 PM Substates
        Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
        Capabilities: [2bc] Power Budgeting <?>
        Capabilities: [2f4] Device Serial Number d2-ba-fa-82-95-2d-b0-48
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

ikr7 avatar May 26 '25 22:05 ikr7

Have you activated both Resizable BAR and Above 4G Decoding in the BIOS?

While those settings were deactivated (default setting in my BIOS) I saw the same results as you:

# nvidia-smi
No devices were found

# meanwhile, output in /var/log/syslog
Jul 01 22:55:07 gpu kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jul 01 22:55:07 gpu kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:884)
Jul 01 22:55:07 gpu kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2b85)
                            NVRM: installed in this system requires use of the NVIDIA open kernel modules.

After activating resizable bar and above 4g decoding, I've installed the driver as follows:


# from https://www.nvidia.com/de-de/drivers/details/250991/
wget "https://us.download.nvidia.com/XFree86/Linux-x86_64/575.64.05/NVIDIA-Linux-x86_64-575.64.05.run" -O "NVIDIA-Linux-x86_64-575.64.05.run"
chmod +x NVIDIA-Linux-x86_64-575.64.05.run
./NVIDIA-Linux-x86_64-575.64.05.run
# choose "MIT/GPL" for the open driver!

Furthermore I've done this: nano /etc/default/grub and edited this line: GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=realloc" (before it was just GRUB_CMDLINE_LINUX_DEFAULT="quiet") and do a update-grub afterwards and reboot. Not sure anymore if the grub change was needed though.

Anyway this has worked for me.

nvidia-smi
Sat Jul 26 15:18:39 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05              Driver Version: 575.64.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:0A:00.0 Off |                  N/A |
| 30%   42C    P8             21W /  400W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Edit: updated driver recommendation

cyril23 avatar Jul 26 '25 08:07 cyril23

Thanks for your reply. However, my card also stopped outputting video, so I sent it back to the manufacturer for repair. I'll update you with more information once it's returned.

ikr7 avatar Jul 30 '25 11:07 ikr7

Have you activated both Resizable BAR and Above 4G Decoding in the BIOS?

While those settings were deactivated (default setting in my BIOS) I saw the same results as you:

# nvidia-smi
No devices were found

# meanwhile, output in /var/log/syslog
Jul 01 22:55:07 gpu kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jul 01 22:55:07 gpu kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:884)
Jul 01 22:55:07 gpu kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2b85)
                            NVRM: installed in this system requires use of the NVIDIA open kernel modules.

After activating resizable bar and above 4g decoding, I've installed the driver as follows:


# from https://www.nvidia.com/de-de/drivers/details/250991/
wget "https://us.download.nvidia.com/XFree86/Linux-x86_64/575.64.05/NVIDIA-Linux-x86_64-575.64.05.run" -O "NVIDIA-Linux-x86_64-575.64.05.run"
chmod +x NVIDIA-Linux-x86_64-575.64.05.run
./NVIDIA-Linux-x86_64-575.64.05.run
# choose "MIT/GPL" for the open driver!

Furthermore I've done this: nano /etc/default/grub and edited this line: GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=realloc" (before it was just GRUB_CMDLINE_LINUX_DEFAULT="quiet") and do a update-grub afterwards and reboot. Not sure anymore if the grub change was needed though.

Anyway this has worked for me.

nvidia-smi
Sat Jul 26 15:18:39 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05              Driver Version: 575.64.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:0A:00.0 Off |                  N/A |
| 30%   42C    P8             21W /  400W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Edit: updated driver recommendation

Thanks a lot!!!!!!!!! Your solution works for my case.

jiangdada1221 avatar Aug 21 '25 10:08 jiangdada1221

@cyril23 thank you for https://github.com/NVIDIA/open-gpu-kernel-modules/issues/862#issuecomment-3121487185, installing it with https://www.nvidia.com/en-us/drivers/details/253003/, in debain 12.12 and choose "MIT/GPL" for the open driver is the solution for nvidia-smi for NVIDIA GeForce RTX 5050 available with Lenovo YOGA AURA edition intel9.

nishadhka avatar Sep 15 '25 06:09 nishadhka

Got my cards fixed back from the manufacturer and it worked flawlessly, meaning that was a hardware malfunction. Closing the issue. Thanks y'all for the helpful comments.

ikr7 avatar Oct 11 '25 08:10 ikr7