open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

System freeze on 580.82.07 due to GSP Timeout

Open JonasGeiping opened this issue 2 months ago • 4 comments

NVIDIA Open GPU Kernel Modules Version

580.82.07

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Pop!_OS 24.04 LTS

Kernel Release

6.16.3-76061603-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 5070 Ti

Describe the bug

Even low activity, such as browsing in Chrome will, after a few minutes lead to a complete freeze of the GUI, from which it cannot recover.

The error log is

Oct 03 17:59:17 mnemosyne kernel: NVRM: GPU at PCI:0000:01:00: GPU-1139a0a3-3982-ae1d-ccd6-8538802f3bdc
Oct 03 17:59:17 mnemosyne kernel: NVRM: GPU Board Serial Number: 0
Oct 03 17:59:17 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 62, 322bff9a 0000b3b8 00000000 206a4b08 206a3cc8 206a3e36 206a2284 206a2ad6
Oct 03 17:59:22 mnemosyne systemd[1]: bluetooth.service - Bluetooth service was skipped because of an unmet condition check (ConditionPathIsDirectory=/sys/class/bluetooth).
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2171, name=xdg-desktop-por, channel 0x0000000a
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2145, name=cosmic-app-libr, channel 0x0000000b
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2168, name=cosmic-files-ap, channel 0x0000000c
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2156, name=cosmic-workspac, channel 0x0000000d
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2171, name=xdg-desktop-por, channel 0x0000000e
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2171, name=xdg-desktop-por, channel 0x0000000f
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2156, name=cosmic-workspac, channel 0x00000010
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2156, name=cosmic-workspac, channel 0x00000011
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2145, name=cosmic-app-libr, channel 0x00000012
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2145, name=cosmic-app-libr, channel 0x00000013
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2168, name=cosmic-files-ap, channel 0x00000014
Oct 03 18:00:02 mnemosyne kernel: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
Oct 03 18:00:02 mnemosyne kernel: NVRM: _kgspLogXid119: Note: Please also check logs above.
Oct 03 18:00:02 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 119, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 6554 (0x20800a56 0x5c).
Oct 03 18:00:02 mnemosyne kernel: NVRM:     task watchdog timeout @ pc:0x1a82424, partition:4#0, task:3
Oct 03 18:00:02 mnemosyne kernel: NVRM:     Reported by libos partition:4#5 kernel v3.1 [0] @ ts:712
Oct 03 18:00:02 mnemosyne kernel: NVRM:     RISC-V CSR State:
Oct 03 18:00:02 mnemosyne kernel: NVRM:         sstatus:0x0000000200000020  sscratch:0xffffffffa3015960     sie:0x0000000000000220  sip:0x0000000000000020
Oct 03 18:00:02 mnemosyne kernel: NVRM:         sepc:0x0000000001a82424     stval:0x0000000000000000  scause:0x8000000000000005
Oct 03 18:00:02 mnemosyne kernel: NVRM:     RISC-V GPR State:
Oct 03 18:00:02 mnemosyne kernel: NVRM:         ra:0x00000000018d5f70   sp:0x00000007f640efc0   gp:0x0000000000000000   tp:0x0000000000000000
Oct 03 18:00:02 mnemosyne kernel: NVRM:         a0:0x0000000000000050   a1:0x0000000000000050   a2:0x00000007f640f180   a3:0x0000000000000000
Oct 03 18:00:02 mnemosyne kernel: NVRM:         a4:0x0000000000000000   a5:0x00000000018d4cd4   a6:0x0007ffffffffffff   a7:0x0000002000000000
Oct 03 18:00:02 mnemosyne kernel: NVRM:         s0:0x00000007f640f030   s1:0x0000000004161d18   s2:0x0000000000000040   s3:0x00000007ef381a90
Oct 03 18:00:02 mnemosyne kernel: NVRM:         s4:0x0000000000000000   s5:0x0000000004160fe0   s6:0x0000000004162908   s7:0x0000000001b44a02
Oct 03 18:00:02 mnemosyne kernel: NVRM:         s8:0x0000000004160fe0   s9:0x0000000004160fe0  s10:0x000000a4cc8260c0  s11:0x0000000004030c90
Oct 03 18:00:02 mnemosyne kernel: NVRM:         t0:0x0000000000000005   t1:0x0000000001a81ce0   t2:0x0000000000000000   t3:0x0007ffffffffffff
Oct 03 18:00:02 mnemosyne kernel: NVRM:         t4:0x0000000006c8c350   t5:0x0000000020000000   t6:0x000000000419b570
Oct 03 18:00:02 mnemosyne kernel: NVRM:     Stack Trace:
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001a82424
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000018ce6bc
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b40d72
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000015875cc
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b44a02
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b788e4
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000017214c4
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000017051fc
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x000000000171baa2
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b325c4
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001432a72
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x000000000184040a
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001841e82
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x000000000184b00a
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b3cde4
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b4a732
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b4acb0
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b09226
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x000000000182f3d4
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001830c26
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x000000000182d55a
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001586dac
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b846f8
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b8c3c8
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001a8aa88
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001bee264
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001a86c3e
Oct 03 18:00:02 mnemosyne kernel: NVRM:     PC Trace:
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001a82424  0x000000000010013e  0x0000000001a82424  0x00000000018d5f6e  0x00000000018ce6b8
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000018ccf48  0x00000000018ce644  0x0000000001b40d6e  0x00000000015875c8  0x0000000001b449fe
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000018cec5a  0x0000000001b449ec  0x0000000001bd2f14  0x0000000001b449be  0x0000000001b788e0
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000014382e0  0x0000000001b3f940  0x0000000001438318  0x0000000001b78978  0x0000000001712e1a
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001b788e8  0x0000000001b44a1c  0x000000000158767c  0x00000000018ce8de  0x0000000001a822f0
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x0000000001a820f2  0x0000000001a821d8  0x00000000018ce95a  0x00000000018ccc68  0x00000000018ce94a
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000018cc804  0x00000000018ce938  0x0000000001a822f0  0x0000000001a820f2  0x0000000001a821d8
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000018ce91c  0x00000000018ccc68  0x00000000018ce90c  0x00000000018ccf48  0x00000000018ce888
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x00000000018ccf48  0x00000000018ce8ee  0x0000000001587652  0x00000000018ce8de
Oct 03 18:00:02 mnemosyne kernel: NVRM:     Local I/O Register State:
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x01450800:0x00000000   0x01450900:0xbadf5100   0x01450a00:0x00000000   0x01450c00:0x00000000
Oct 03 18:00:02 mnemosyne kernel: NVRM:         0x01454a00:0x810490d2   0x01454b00:0x010800d0   0x01454c00:0x00080000   0x01400200:0x00000000
Oct 03 18:00:02 mnemosyne kernel: NVRM: GPU0 GSP RPC buffer contains function 4100 (RC_TRIGGERED) sequence 0 and data 0x0000000000000001 0x000000000000002d.
Oct 03 18:00:02 mnemosyne kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Oct 03 18:00:02 mnemosyne kernel: NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration actively_polling
Oct 03 18:00:02 mnemosyne kernel: NVRM:      0    76   GSP_RM_CONTROL              6554 0x0000000020800a56 0x000000000000005c 0x000640432f1f20c6 0x0000000000000000          y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -1    76   GSP_RM_CONTROL              6553 0x000000002080a0d1 0x00000000000007e8 0x000640432f1e7421 0x000640432f1e7f67   2886us
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -2    76   GSP_RM_CONTROL              6552 0x000000002080a0d1 0x00000000000007e8 0x000640432f1d3338 0x000640432f1d3f1b   3043us
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -3    76   GSP_RM_CONTROL              6551 0x000000002080a0d1 0x00000000000007e8 0x000640432f18d3b7 0x000640432f18df87   3024us
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -4    76   GSP_RM_CONTROL              6550 0x000000002080a0d1 0x00000000000007e8 0x000640432f170018 0x000640432f170ca7   3215us
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -5    76   GSP_RM_CONTROL              6549 0x000000002080a0d1 0x00000000000007e8 0x000640432f0b5d07 0x000640432f0b5fda    723us
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -6    76   GSP_RM_CONTROL              6548 0x000000002080a0d1 0x00000000000007e8 0x000640432f0a171b 0x000640432f0a1eef   2004us
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -7    76   GSP_RM_CONTROL              6547 0x000000002080a0d1 0x00000000000007e8 0x000640432f05b7aa 0x000640432f05bf78   1998us
Oct 03 18:00:02 mnemosyne kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Oct 03 18:00:02 mnemosyne kernel: NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Oct 03 18:00:02 mnemosyne kernel: NVRM:      0    4100 RC_TRIGGERED                   0 0x0000000000000001 0x000000000000002d 0x0006404330e9f7c1 0x0006404330e9f7c6      5us y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -1    4102 OS_ERROR_LOG                   0 0x0000000000000000 0x0000000000000000 0x0006404330e9ee71 0x0006404330e9ee74      3us y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -2    4100 RC_TRIGGERED                   0 0x0000000000000001 0x000000000000002d 0x0006404330e9dd78 0x0006404330e9dd7b      3us y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -3    4102 OS_ERROR_LOG                   0 0x0000000000000000 0x0000000000000000 0x0006404330e9d4b8 0x0006404330e9d4ba      2us y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -4    4100 RC_TRIGGERED                   0 0x0000000000000001 0x000000000000002d 0x0006404330e9c1ec 0x0006404330e9c1f0      4us y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -5    4102 OS_ERROR_LOG                   0 0x0000000000000000 0x0000000000000000 0x0006404330e9b8ef 0x0006404330e9b8f1      2us y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -6    4100 RC_TRIGGERED                   0 0x0000000000000001 0x000000000000002d 0x0006404330e9a735 0x0006404330e9a739      4us y
Oct 03 18:00:02 mnemosyne kernel: NVRM:     -7    4102 OS_ERROR_LOG                   0 0x0000000000000000 0x0000000000000000 0x0006404330e99e53 0x0006404330e99e55      2us y
Oct 03 18:00:02 mnemosyne kernel: CPU: 8 UID: 0 PID: 1185 Comm: nv_queue Tainted: G        W  OE       6.16.3-76061603-generic #202508231538~1759252525~24.04~c08ae99 PREEMPT(voluntary)
Oct 03 18:00:02 mnemosyne kernel: Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Oct 03 18:00:02 mnemosyne kernel: Hardware name: ASUS System Product Name/TUF GAMING B850-PLUS WIFI, BIOS 1028 04/29/2025
Oct 03 18:00:02 mnemosyne kernel: Call Trace:
Oct 03 18:00:02 mnemosyne kernel:  <TASK>
Oct 03 18:00:02 mnemosyne kernel:  dump_stack_lvl+0x76/0xa0
Oct 03 18:00:02 mnemosyne kernel:  dump_stack+0x10/0x20
Oct 03 18:00:02 mnemosyne kernel:  os_dump_stack+0xe/0x20 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  _kgspRpcRecvPoll+0x650/0x820 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  _issueRpcAndWait+0xdd/0x970 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel:  ? osGetCurrentThread+0x26/0x60 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? rmDeviceGpuLockIsOwner+0x29/0x90 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel:  ? os_mem_set+0x14/0x20 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel:  rpcRmApiControl_GSP+0x76f/0x940 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel:  gpuLogOobXidMessage_KERNEL+0x10e/0x140 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  nvErrorLog2+0xa6/0x100 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  nvErrorLog_va+0x42/0x50 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  _gpuRefreshRecoveryActionInLock+0x134/0x170 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? _gpuRefreshRecoveryActionInLock+0xf7/0x170 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel:  rm_execute_work_item+0x121/0x1d0 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  os_execute_work_item+0x28/0x90 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  _main_loop+0x7e/0x140 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia]
Oct 03 18:00:02 mnemosyne kernel:  kthread+0x10a/0x230
Oct 03 18:00:02 mnemosyne kernel:  ? __pfx_kthread+0x10/0x10
Oct 03 18:00:02 mnemosyne kernel:  ret_from_fork+0x121/0x140
Oct 03 18:00:02 mnemosyne kernel:  ? __pfx_kthread+0x10/0x10
Oct 03 18:00:02 mnemosyne kernel:  ret_from_fork_asm+0x1a/0x30
Oct 03 18:00:02 mnemosyne kernel:  </TASK>
Oct 03 18:00:02 mnemosyne kernel: NVRM: _kgspLogXid119: ********************************************************************************
Oct 03 18:00:02 mnemosyne kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 6554!
Oct 03 18:00:02 mnemosyne kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from pRmApi->Control(pRmApi, pGpu->hInternalClient, pGpu->hInternalSubdevice, NV2080_CTRL_CMD_INTERNAL_LOG_OO>
Oct 03 18:00:02 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
Oct 03 18:00:17 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:22 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:26 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:31 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:35 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:40 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:44 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 119, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 6555 (0x2080a0d1 0x7e8).
Oct 03 18:00:47 mnemosyne kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 6555!
Oct 03 18:00:48 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:53 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:57 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:02 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:06 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:10 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:15 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:19 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:24 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:28 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:32 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=1154, name=nvidia-modeset/, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 6556 (0x20802801 0x4).
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset @ kernel_gsp.c:2366
Oct 03 18:01:32 mnemosyne kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 6556!
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:180
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ vaspace_api.c:573
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:180
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ vaspace_api.c:573
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259

[many more failing assertions follow]

To Reproduce

Low activity use of the system with the 580.82 driver.

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz this is the output of sudo nvidia-bug-report.sh added for completeness, but I don't think it covers the freeze, because it was executed after a subsequent reboot.

More Info

Switching to tty3 also fails (unclear if related), which makes recovery hard

Oct 03 18:01:32 mnemosyne systemd[1]: Started [email protected] - Getty on tty3.
Oct 03 18:01:32 mnemosyne kernel: fbcon: Taking over console
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: Console: switching to colour frame buffer device 160x45
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:5187
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:4129
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:611
Oct 03 18:01:32 mnemosyne kernel: NVRM: vaspaceapiConstruct_IMPL: Could not construct VA space. Status 62
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:180
Oct 03 18:01:33 mnemosyne kernel: NVRM: vaspaceapiConstruct_IMPL: Could not construct VA space. Status 62
[... more assertions failing ...]
Oct 03 18:01:34 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:1375
Oct 03 18:01:35 mnemosyne systemd[1]: run-user-1000-gvfs.mount: Deactivated successfully.
Oct 03 18:01:35 mnemosyne systemd[1]: run-user-1000-doc.mount: Deactivated successfully.
Oct 03 18:01:35 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:36 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:5187
Oct 03 18:01:36 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:4129
Oct 03 18:01:36 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:611
Oct 03 18:01:36 mnemosyne kernel: NVRM: vaspaceapiConstruct_IMPL: Could not construct VA space. Status 62
Oct 03 18:01:36 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:36 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:36 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:37 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:37 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:37 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:37 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:37 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:41 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:41 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:41 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:41 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:41 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:44 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x00000062
Oct 03 18:01:46 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:46 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:46 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:46 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:46 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:50 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:50 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:50 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:50 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:50 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:54 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x00000062
Oct 03 18:01:54 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:54 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:54 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:54 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:54 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:59 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:59 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:59 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:59 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:59 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:02:02 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:02:03 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:02:03 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:02:03 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:02:03 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:02:03 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:02:04 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x00000062
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:1375
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844

JonasGeiping avatar Oct 03 '25 16:10 JonasGeiping

I assume related to https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446 from version 525.85.05.

JonasGeiping avatar Oct 03 '25 16:10 JonasGeiping

same issue please do let me know if you find solution

ALIENvsROBOT avatar Oct 13 '25 15:10 ALIENvsROBOT

I had a similar problem after using the GPU for 45 seconds with comfyui. The whole thing frozen and device doesn't show up on nvidia-smi. then it works only after reboot. I have nvidia pro 6000 Blackwell paid hell amount of money just to get driver error and I can't even use it. I tried driver 580 575 and 570 (open) everything tend to fail in the end.

ALIENvsROBOT avatar Oct 13 '25 19:10 ALIENvsROBOT

I uploaded my failure report which was perfectly captured. in issue #949

ALIENvsROBOT avatar Oct 14 '25 09:10 ALIENvsROBOT