open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

loading /lib/firmware/nvidia/570.133.20/gsp_ga10x.bin failed with error -4

Open liho00 opened this issue 3 months ago • 4 comments

NVIDIA Open GPU Kernel Modules Version

570.172.08

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

No LSB modules are available. Description: Ubuntu 24.04.2 LTS

Kernel Release

Linux cvm 6.8.0-85-generic #85-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 18 15:26:59 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [ ] I am running on a stable kernel release.

Hardware: GPU

nvidia-smi -L GPU 0: NVIDIA H200 (UUID: GPU-e9c8dd70-5424-0717-37e2-0a2b101e152f) GPU 1: NVIDIA H200 (UUID: GPU-c4690ddf-5490-8dfb-6527-a70ff5248219) GPU 2: NVIDIA H200 (UUID: GPU-737e21a0-ace6-ffec-2f2b-b85d2bfc7649) GPU 3: NVIDIA H200 (UUID: GPU-6bc4a1f0-256e-4106-3725-0c2942705741) GPU 4: NVIDIA H200 (UUID: GPU-ad2b7891-04a5-7899-f311-24a1410e62ab)

It should shows 8 gpus, but turns out 5 shown only

Describe the bug

[   81.335676] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  570.172.08  Release Build  (dvs-builder@U22-I3-AF01-21-3)  Tue Jul  8 17:59:47 UTC 2025
[   81.379549] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   81.915785] ACPI Warning: \_SB.PCI0.S18.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  110.628102] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  110.628131] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  112.430325] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[  112.430354] nvidia 0000:01:00.0: [drm] No compatible format found
[  112.430359] nvidia 0000:01:00.0: [drm] Cannot find any crtc or sizes
[  112.431143] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[  112.827594] ACPI Warning: \_SB.PCI0.S19.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  140.713584] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  140.713613] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  142.957348] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 1
[  142.957378] nvidia 0000:02:00.0: [drm] No compatible format found
[  142.957383] nvidia 0000:02:00.0: [drm] Cannot find any crtc or sizes
[  142.958190] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[  143.351850] ACPI Warning: \_SB.PCI0.S1A.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  171.491409] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  171.491439] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  174.237193] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 2
[  174.237218] nvidia 0000:03:00.0: [drm] No compatible format found
[  174.237222] nvidia 0000:03:00.0: [drm] Cannot find any crtc or sizes
[  174.238645] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[  174.634643] ACPI Warning: \_SB.PCI0.S1B.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  202.909622] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  202.909657] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  206.198059] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 3
[  206.198096] nvidia 0000:04:00.0: [drm] No compatible format found
[  206.198100] nvidia 0000:04:00.0: [drm] Cannot find any crtc or sizes
[  206.198803] [drm] [nvidia-drm] [GPU ID 0x00000500] Loading driver
[  206.587998] ACPI Warning: \_SB.PCI0.S1C.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  232.165176] systemd-journald[2045]: /var/log/journal/502ddb25b848499f9617d76c3b27b682/user-1000.journal: Journal file uses a different sequence number ID, rotating.
[  234.780660] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  234.780686] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2256
[  238.662777] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:05:00.0 on minor 4
[  238.662812] nvidia 0000:05:00.0: [drm] No compatible format found
[  238.662819] nvidia 0000:05:00.0: [drm] Cannot find any crtc or sizes
[  238.664064] [drm] [nvidia-drm] [GPU ID 0x00000600] Loading driver
[  238.717429] nvidia 0000:06:00.0: loading /lib/firmware/updates/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.742589] nvidia 0000:06:00.0: loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.742605] nvidia 0000:06:00.0: Direct firmware load for nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.742623] NVRM: RmFetchGspRmImages: No firmware image found
[  238.742765] NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x61:0x56:1770)
[  238.750716] NVRM: GPU 0000:06:00.0: rm_init_adapter failed, device minor number 5
[  238.759003] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000600] Failed to allocate NvKmsKapiDevice
[  238.766228] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000600] Failed to register device
[  238.775988] [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
[  238.827032] nvidia 0000:07:00.0: loading /lib/firmware/updates/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.849396] nvidia 0000:07:00.0: loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.849414] nvidia 0000:07:00.0: Direct firmware load for nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.849435] NVRM: RmFetchGspRmImages: No firmware image found
[  238.849562] NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x61:0x56:1770)
[  238.859194] NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 6
[  238.868979] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to allocate NvKmsKapiDevice
[  238.877240] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to register device
[  238.885251] [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
[  238.944361] nvidia 0000:08:00.0: loading /lib/firmware/updates/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.963752] nvidia 0000:08:00.0: loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.963760] nvidia 0000:08:00.0: Direct firmware load for nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.963893] NVRM: RmFetchGspRmImages: No firmware image found
[  238.963982] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x61:0x56:1770)
[  238.971855] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 7
[  238.979302] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to allocate NvKmsKapiDevice
[  238.986998] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to register device

To Reproduce

everytime it boot, this happened

Bug Incidence

Once

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

liho00 avatar Oct 04 '25 17:10 liho00

[  238.944361] nvidia 0000:08:00.0: loading /lib/firmware/updates/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.963752] nvidia 0000:08:00.0: loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.963760] nvidia 0000:08:00.0: Direct firmware load for nvidia/570.172.08/gsp_ga10x.bin failed with error -4

do either of these exist on your filesystem?

/lib/firmware/updates/nvidia/570.172.08/gsp_ga10x.bin
/lib/firmware/nvidia/570.172.08/gsp_ga10x.bin

or, what does find /lib/firmware -name gsp_ga10x.bin return?

I suspect this is going to end up being a packaging issue.

aritger avatar Oct 06 '25 05:10 aritger

[  238.944361] nvidia 0000:08:00.0: loading /lib/firmware/updates/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.963752] nvidia 0000:08:00.0: loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
[  238.963760] nvidia 0000:08:00.0: Direct firmware load for nvidia/570.172.08/gsp_ga10x.bin failed with error -4

do either of these exist on your filesystem?

/lib/firmware/updates/nvidia/570.172.08/gsp_ga10x.bin
/lib/firmware/nvidia/570.172.08/gsp_ga10x.bin

or, what does find /lib/firmware -name gsp_ga10x.bin return?

I suspect this is going to end up being a packaging issue.

Hi @aritger

modinfo nvidia | grep '^version' ls -l /lib/firmware/nvidia/570.172.08/ ls -l /lib/firmware/nvidia/*/gsp_ga10x.bin version: 570.172.08 total 90580 -rw-r--r-- 1 root root 63858416 Jul 8 17:48 gsp_ga10x.bin -rw-r--r-- 1 root root 28890200 Jul 8 17:48 gsp_tu10x.bin -rw-r--r-- 1 root root 63858416 Jul 8 17:48 /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin

I am 100% sure, the files are exists.

I do have the gsp_ga10x.bin in the right place (/lib/firmware/nvidia/570.172.08/, size 63 MB). That means the firmware isn’t missing — but the driver still failed with error -4 when trying to load it. In kernel/firmware context, error -4 = -EINTR / “interrupted system call”, which usually means:

The firmware was found, but loading was interrupted or blocked.

liho00 avatar Oct 06 '25 07:10 liho00

Thanks for the reply. I don't know why nvidia.ko's call to request_firmware() would encounter EINTR.

Is there anything special/unusual about your Linux distro installation or configuration? Skimming the kernel sources for the "failed with error" message, it looks like there is some handling for "partial reads". Since the firmware is pretty large, maybe there is something configured on this system to define the maximum firmware size?

Or, maybe it is somehow sensitive to the number of the GPUs in the system? (8, in this case)

Wild guess: does the behavior change if you set the nvidia.ko kernel module parameter NVreg_EnableNonblockingOpen to 0? E.g., modprobe nvidia NVreg_EnableNonblockingOpen=0?

aritger avatar Oct 06 '25 21:10 aritger

Hi @aritger I'm facing a similar issue 1. Open drivers: [ 10.842107] NVRM: s_vbiosPatchInterfaceData: too few interface entires found for FWSEC cmd 0x15 [ 10.842112] NVRM: s_prepareForFwsec_TU102: Falcon ucode from hs [ 10.842114] NVRM: s_prepareForFwsec_TU102: failed to prepare interface data for FWSEC cmd 0x15: 0x25 [ 10.842116] NVRM: s_prepareForFwsec_TU102: (note: VBIOS version 94.02.71.40.83) [ 10.842119] NVRM: nvCheckOkFailedNoLog: Check failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from kgspPrepareForBootstrap_HAL(pGpu, pKernelGsp, KGSP_BOOT_MODE_NORMAL) @ kernel_gsp.c:3664 [ 10.842171] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM [ 10.843941] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x25:2015) [ 10.845693] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

2. Proprietary drivers:

nvidia-bug-report.log.gz

Could you help me understand what caused this issue? especially the Cannot initialize GSP firmware RM. I have an geforce rtx 3080.

hrushirajg23 avatar Nov 10 '25 04:11 hrushirajg23