open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock

Open FlyGoat opened this issue 2 years ago • 7 comments

NVIDIA Open GPU Kernel Modules Version

545.29.06

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [X] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.6.7

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [X] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4060 (10de:28e0)

Describe the bug

dmesg spam with:

 NVRM rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock

To Reproduce

Boot on such system, and check dmesg.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

I tried to debug the issue by appending a os_stack_trace() after where the message is printed, I got the following backtrace:

[  854.764480] NVRM rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock
[  854.764483] CPU: 15 PID: 5542 Comm: kworker/15:0 Tainted: G           OE      6.6.7-arch1-1 #1 4505c4baa0b3d7c4037b0e8f5402626fa360717f
[  854.764486] Hardware name: ASUSTeK COMPUTER INC. ROG Zephyrus G14 GA402XV_GA402XV/GA402XV, BIOS GA402XV.313 08/10/2023
[  854.764487] Workqueue: pm pm_runtime_work
[  854.764490] Call Trace:
[  854.764491]  <TASK>
[  854.764492]  dump_stack_lvl+0x47/0x60
[  854.764496]  rmapiAllocWithSecInfo+0x306/0x410 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764574]  ? srso_alias_return_thunk+0x5/0x7f
[  854.764575]  ? __kmem_cache_alloc_node+0x1a6/0x340
[  854.764577]  ? os_alloc_mem+0xc8/0xe0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764631]  ? os_alloc_mem+0xc8/0xe0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764684]  ? srso_alias_return_thunk+0x5/0x7f
[  854.764686]  ? __kmalloc+0x50/0x150
[  854.764689]  rmapiAlloc+0x27/0x40 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764758]  memdescSendMemDescToGSP+0x171/0x2c0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764837]  ? memdescSendMemDescToGSP+0x120/0x2c0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764912]  fbsrCopyMemoryMemDesc_GM107+0x46a/0xe80 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764988]  ? _issueRpcAndWait+0x3c/0x210 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765074]  _memmgrWalkHeap+0x156/0x680 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765146]  memmgrSavePowerMgmtState_KERNEL+0x18b/0x320 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765218]  gpuPowerManagementEnter.constprop.0+0x6a/0x2e0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765295]  ? srso_alias_return_thunk+0x5/0x7f
[  854.765298]  gpuEnterStandby_IMPL+0x109/0x280 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765369]  ? srso_alias_return_thunk+0x5/0x7f
[  854.765372]  RmPowerManagementInternal+0x113/0x1a0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765451]  RmGcxPowerManagement+0x2fc/0x360 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765522]  ? rmGpuLocksAcquire+0xbb/0x130 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765601]  rm_transition_dynamic_power+0x83/0x122 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765671]  ? srso_alias_return_thunk+0x5/0x7f
[  854.765676]  nv_pmops_runtime_suspend+0x6f/0x100 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765726]  pci_pm_runtime_suspend+0x67/0x1e0
[  854.765728]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  854.765730]  __rpm_callback+0x41/0x170
[  854.765732]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  854.765734]  rpm_callback+0x5d/0x70
[  854.765736]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  854.765738]  rpm_suspend+0x120/0x6a0
[  854.765740]  ? __pfx_pci_pm_runtime_idle+0x10/0x10
[  854.765742]  pm_runtime_work+0x84/0xb0
[  854.765745]  process_one_work+0x171/0x340
[  854.765747]  worker_thread+0x27b/0x3a0
[  854.765749]  ? __pfx_worker_thread+0x10/0x10
[  854.765750]  kthread+0xe5/0x120
[  854.765752]  ? __pfx_kthread+0x10/0x10
[  854.765754]  ret_from_fork+0x31/0x50
[  854.765756]  ? __pfx_kthread+0x10/0x10
[  854.765758]  ret_from_fork_asm+0x1b/0x30
[  854.765762]  </TASK>

Backtraces are almost the same each time it printed.

FlyGoat avatar Dec 17 '23 15:12 FlyGoat

Can confirm the issue on NixOS as well. Has been happening for earlier kernel versions on my system as well but currently it is Nvidia: Hardware: GeForce 3080Ti (mobile) Version: 545.29.06 Kernel: 6.6.8

In my case most of the time this happens when waking up again from suspend or the machine sitting around unused for some time

relief-melone avatar Jan 04 '24 10:01 relief-melone

Same here, just hangup after updating to 545 today

Possible Fix

After showing the above error message, I found a line at the very end hdaudio hdaudioCOD2:unable to configure disabling. After searching online, I tried the fix as below, and able to log back into ubuntu normally.

# use recovery mode and enter root shell
# nano /etc/default/grub and make below change

# original
"GRUB_LINUX_DEFAULT="quiet splash nomodeset"

# to
"GRUB_LINUX_DEFAULT="quiet splash"

# then `update-grub` and `reboot`

TZECHIN6 avatar Jan 12 '24 15:01 TZECHIN6

Nvidia folks, any chance we can get this fixed in next release?

FlyGoat avatar Jan 31 '24 12:01 FlyGoat

This bug is not fun! At a minimum its an indefinite delay and at worst its a crash on shutdown with no ability to open a tty and save the day.

This gets spammed in dmesg:

NVRM rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock

I also get this spam in dmesg too:

[  122.596895] NVRM nbsiReadRegistryDword: osReadRegistryDword called in Sleep path can cause excessive delays!
[  122.596903] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ nbsi_osrg.c:107

$ cat /proc/driver/nvidia/version outputs: NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 545.23.08 Release Build (dvs-builder@U16-I3-A16-1-1)

apolopena avatar Feb 14 '24 07:02 apolopena

Hi, thanks for the report! This is tracked internally as bug 4074148. Hard to say which release will get the fix due to schedules and release branching.

For whatever it's worth, the root cause of the print from that particular call stack (rm_power_management()) was found to be "harmless", except for the print spam. Any other issues you are seeing are likely to be independent and deserve a separate bug report.

mtijanic avatar Feb 14 '24 17:02 mtijanic

Still happening with 550.78.

[   15.007542] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.78  Release Build  (dvs-builder@U16-I1-N08-06-4)  Sun Apr 14 06:38:24 UTC 2024
[   15.302560] NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11.
[   36.803975] NVRM: rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock

scaronni avatar Apr 27 '24 19:04 scaronni

I confirm the issue on kernel image 6.11.5 too. I had created a bug report on debian with a side effect related to this issue. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1086638#34. data & data1 files have the logs to verify it.

fritzmatias avatar Nov 13 '24 12:11 fritzmatias