open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Nvidia RTX 5070TI | 575.64.03 drivers | Suspend Issue

Open Mario156090 opened this issue 5 months ago • 10 comments

NVIDIA Open GPU Kernel Modules Version

575.64.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.15.4-zen2-1-zen

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 5070 Ti Laptop GPU (UUID: GPU-846a94a0-8570-27c4-9943-057ea9ee7cea)

Describe the bug

After suspending the laptop, an error appears in dmesg indicating that it cannot be suspended and then an exception related to nv_pmops_runtime_suspend is generated.

To Reproduce

Turn on the laptop. Hit suspend. Check the dmesg

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

the bug report is stuckuing, here the image:

Image

I will attach journal output from my system where You can read the logs.

Error logs.zip

More Info

No response

Mario156090 avatar Jul 04 '25 17:07 Mario156090

❯ cat journal.log | grep -i nvrm jul 03 20:02:06 msi-arch kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 575.64.03 Release Build (root@) jul 03 20:04:13 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 03 20:04:17 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 03 20:05:27 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 03 20:05:29 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 03 20:06:05 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 03 20:06:08 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 03 20:27:20 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 07:37:42 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 11:54:53 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 11:54:55 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 11:54:55 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 11:54:56 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 11:55:00 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 11:55:03 msi-arch kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. jul 04 11:55:11 msi-arch kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353 jul 04 11:55:11 msi-arch kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from status @ kernel_gsp.c:4615 jul 04 11:55:11 msi-arch kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1303 jul 04 11:55:28 msi-arch kernel: NVRM: Error in service of callback jul 04 11:59:54 msi-arch kernel: NVRM: GPU 0000:01:00.0: PreserveVideoMemoryAllocations module parameter is set. System Power Management attempted without driver procfs suspend interface. Please refer to the 'Configuring Power Management Support' section in the driver README. jul 04 11:59:55 msi-arch kernel: NVRM: GPU 0000:01:00.0: PreserveVideoMemoryAllocations module parameter is set. System Power Management attempted without driver procfs suspend interface. Please refer to the 'Configuring Power Management Support' section in the driver README.

Mario156090 avatar Jul 04 '25 23:07 Mario156090

Happened newly, I reach to capture the bug report and this was attached in the main thread.

Image

Mario156090 avatar Jul 05 '25 13:07 Mario156090

see https://github.com/NVIDIA/open-gpu-kernel-modules/issues/887#issuecomment-3054482747

lumingzh avatar Jul 11 '25 23:07 lumingzh

Got this as well. I think one of the newer linux kernels last week broke suspend (speaking of Arch, beginning 6.15.5 I believe); or the linux-firmware that came around the same time. As of 6.15.6 the issue still persists.

The laptop now suspends for ~10s before waking itself up again. Hibernate works okay.

[  751.267398] PM: suspend entry (s2idle)
[  751.653756] Filesystems sync: 0.386 seconds
[  751.749605] Bluetooth: hci0: Invalid exception type 04
[  751.754904] Freezing user space processes
[  751.756938] Freezing user space processes completed (elapsed 0.002 seconds)
[  751.756944] OOM killer disabled.
[  751.756945] Freezing remaining freezable tasks
[  751.757963] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[  751.757970] printk: Suspending console(s) (use no_console_suspend to debug)
[  751.985523] ACPI: EC: interrupt blocked
[  764.820176] ACPI: EC: interrupt unblocked
[  764.838179] ACPI Warning: Time parameter 250 us > 100 us violating ACPI spec, please fix the firmware. (20240827/exsystem-142)
[  765.225486] nvidia 0000:02:00.0: Enabling HDA controller
[  765.360392] pci 0000:00:08.0: Setting to D3hot
[  765.377666] spd5118 5-0051: Failed to write b = 0: -6
[  765.377672] spd5118 5-0051: PM: dpm_run_callback(): spd5118_resume [spd5118] returns -6
[  765.377685] spd5118 5-0051: PM: failed to resume async: error -6
[  765.378005] spd5118 5-0053: Failed to write b = 0: -6
[  765.378012] spd5118 5-0053: PM: dpm_run_callback(): spd5118_resume [spd5118] returns -6
[  765.378021] spd5118 5-0053: PM: failed to resume async: error -6
[  765.386423] nvme nvme1: D3 entry latency set to 10 seconds
[  765.391382] nvme nvme0: D3 entry latency set to 10 seconds
[  765.399188] nvme nvme0: 24/0/0 default/read/poll queues
[  765.438618] Bluetooth: hci0: Invalid exception type 04
[  765.483940] nvme nvme1: 15/0/0 default/read/poll queues
[  765.525282] OOM killer enabled.
[  765.525284] Restarting tasks ... done.
[  765.528107] random: crng reseeded on system resumption
[  765.646798] PM: suspend exit

(Didn't find much in the logs though)

Additionally it seems DP-out (via USBC / TB5 on my laptop) was also broken by the new kernel. Right now only HDMI-out works.

sk0rabu avatar Jul 12 '25 03:07 sk0rabu

Hi All, Thanks for reporting issue, could you please apply below patch and see if it fixes the issue. https://github.com/NVIDIA/open-gpu-kernel-modules/commit/c7e72135da83ff027755b4a61a3ff09a32fe00c3

amrit1711 avatar Jul 14 '25 09:07 amrit1711

Hi All, Thanks for reporting issue, could you please apply below patch and see if it fixes the issue. c7e7213

I don't know how try that patch but after 575.64.05 I think the bug has been solved because there is no errors neveremore.

Image

Mario156090 avatar Jul 25 '25 17:07 Mario156090

All clear for me too - I believe an intermediate kernel update might also have played a part.

sk0rabu avatar Jul 26 '25 11:07 sk0rabu

@sk0rabu The patch @amrit1711 proposed seemed to make it into a release before .05 and likely fixed it, either that or the patch isn't applying fully anymore on .05 because it gave an error saying there's nothing to patch(?!) when I tried applying it on Gentoo.

Cloudwalk9 avatar Jul 31 '25 10:07 Cloudwalk9

@amrit1711 Well, after some time I can get an error after suspend newly:

This happend when I suspended the laptop while a game was using the GPU. After that RTD3 was not working nevermore and the GPU stay on until reboot or power off the laptop.

I will attach the nvidia bug report zip.

nvidia-bug-report.log.gz

[ 9.098158] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 575.64.05 Release Build (root@) [ 11.078195] NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. [ 13.258310] NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11. [ 8885.158530] NVRM: _kdispHandleAwakenChnMask: seeing an awaken in channel 0 without an associated awaken event [ 8894.997915] NVRM: Error in service of callback

Mario156090 avatar Jul 31 '25 13:07 Mario156090

I have the same issue on 575.64.05 - same behavior as http://github.com/NVIDIA/open-gpu-kernel-modules/issues/896#issuecomment-3064558041

francoism90 avatar Jul 31 '25 18:07 francoism90