System freeze on 580.82.07 due to GSP Timeout
NVIDIA Open GPU Kernel Modules Version
580.82.07
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Pop!_OS 24.04 LTS
Kernel Release
6.16.3-76061603-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
NVIDIA GeForce RTX 5070 Ti
Describe the bug
Even low activity, such as browsing in Chrome will, after a few minutes lead to a complete freeze of the GUI, from which it cannot recover.
The error log is
Oct 03 17:59:17 mnemosyne kernel: NVRM: GPU at PCI:0000:01:00: GPU-1139a0a3-3982-ae1d-ccd6-8538802f3bdc
Oct 03 17:59:17 mnemosyne kernel: NVRM: GPU Board Serial Number: 0
Oct 03 17:59:17 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 62, 322bff9a 0000b3b8 00000000 206a4b08 206a3cc8 206a3e36 206a2284 206a2ad6
Oct 03 17:59:22 mnemosyne systemd[1]: bluetooth.service - Bluetooth service was skipped because of an unmet condition check (ConditionPathIsDirectory=/sys/class/bluetooth).
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2171, name=xdg-desktop-por, channel 0x0000000a
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2145, name=cosmic-app-libr, channel 0x0000000b
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2168, name=cosmic-files-ap, channel 0x0000000c
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2156, name=cosmic-workspac, channel 0x0000000d
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2171, name=xdg-desktop-por, channel 0x0000000e
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2171, name=xdg-desktop-por, channel 0x0000000f
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2156, name=cosmic-workspac, channel 0x00000010
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2156, name=cosmic-workspac, channel 0x00000011
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2145, name=cosmic-app-libr, channel 0x00000012
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2145, name=cosmic-app-libr, channel 0x00000013
Oct 03 17:59:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=2168, name=cosmic-files-ap, channel 0x00000014
Oct 03 18:00:02 mnemosyne kernel: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
Oct 03 18:00:02 mnemosyne kernel: NVRM: _kgspLogXid119: Note: Please also check logs above.
Oct 03 18:00:02 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 119, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 6554 (0x20800a56 0x5c).
Oct 03 18:00:02 mnemosyne kernel: NVRM: task watchdog timeout @ pc:0x1a82424, partition:4#0, task:3
Oct 03 18:00:02 mnemosyne kernel: NVRM: Reported by libos partition:4#5 kernel v3.1 [0] @ ts:712
Oct 03 18:00:02 mnemosyne kernel: NVRM: RISC-V CSR State:
Oct 03 18:00:02 mnemosyne kernel: NVRM: sstatus:0x0000000200000020 sscratch:0xffffffffa3015960 sie:0x0000000000000220 sip:0x0000000000000020
Oct 03 18:00:02 mnemosyne kernel: NVRM: sepc:0x0000000001a82424 stval:0x0000000000000000 scause:0x8000000000000005
Oct 03 18:00:02 mnemosyne kernel: NVRM: RISC-V GPR State:
Oct 03 18:00:02 mnemosyne kernel: NVRM: ra:0x00000000018d5f70 sp:0x00000007f640efc0 gp:0x0000000000000000 tp:0x0000000000000000
Oct 03 18:00:02 mnemosyne kernel: NVRM: a0:0x0000000000000050 a1:0x0000000000000050 a2:0x00000007f640f180 a3:0x0000000000000000
Oct 03 18:00:02 mnemosyne kernel: NVRM: a4:0x0000000000000000 a5:0x00000000018d4cd4 a6:0x0007ffffffffffff a7:0x0000002000000000
Oct 03 18:00:02 mnemosyne kernel: NVRM: s0:0x00000007f640f030 s1:0x0000000004161d18 s2:0x0000000000000040 s3:0x00000007ef381a90
Oct 03 18:00:02 mnemosyne kernel: NVRM: s4:0x0000000000000000 s5:0x0000000004160fe0 s6:0x0000000004162908 s7:0x0000000001b44a02
Oct 03 18:00:02 mnemosyne kernel: NVRM: s8:0x0000000004160fe0 s9:0x0000000004160fe0 s10:0x000000a4cc8260c0 s11:0x0000000004030c90
Oct 03 18:00:02 mnemosyne kernel: NVRM: t0:0x0000000000000005 t1:0x0000000001a81ce0 t2:0x0000000000000000 t3:0x0007ffffffffffff
Oct 03 18:00:02 mnemosyne kernel: NVRM: t4:0x0000000006c8c350 t5:0x0000000020000000 t6:0x000000000419b570
Oct 03 18:00:02 mnemosyne kernel: NVRM: Stack Trace:
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001a82424
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000018ce6bc
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b40d72
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000015875cc
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b44a02
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b788e4
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000017214c4
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000017051fc
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x000000000171baa2
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b325c4
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001432a72
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x000000000184040a
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001841e82
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x000000000184b00a
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b3cde4
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b4a732
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b4acb0
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b09226
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x000000000182f3d4
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001830c26
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x000000000182d55a
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001586dac
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b846f8
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b8c3c8
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001a8aa88
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001bee264
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001a86c3e
Oct 03 18:00:02 mnemosyne kernel: NVRM: PC Trace:
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001a82424 0x000000000010013e 0x0000000001a82424 0x00000000018d5f6e 0x00000000018ce6b8
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000018ccf48 0x00000000018ce644 0x0000000001b40d6e 0x00000000015875c8 0x0000000001b449fe
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000018cec5a 0x0000000001b449ec 0x0000000001bd2f14 0x0000000001b449be 0x0000000001b788e0
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000014382e0 0x0000000001b3f940 0x0000000001438318 0x0000000001b78978 0x0000000001712e1a
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001b788e8 0x0000000001b44a1c 0x000000000158767c 0x00000000018ce8de 0x0000000001a822f0
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x0000000001a820f2 0x0000000001a821d8 0x00000000018ce95a 0x00000000018ccc68 0x00000000018ce94a
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000018cc804 0x00000000018ce938 0x0000000001a822f0 0x0000000001a820f2 0x0000000001a821d8
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000018ce91c 0x00000000018ccc68 0x00000000018ce90c 0x00000000018ccf48 0x00000000018ce888
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x00000000018ccf48 0x00000000018ce8ee 0x0000000001587652 0x00000000018ce8de
Oct 03 18:00:02 mnemosyne kernel: NVRM: Local I/O Register State:
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x01450800:0x00000000 0x01450900:0xbadf5100 0x01450a00:0x00000000 0x01450c00:0x00000000
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0x01454a00:0x810490d2 0x01454b00:0x010800d0 0x01454c00:0x00080000 0x01400200:0x00000000
Oct 03 18:00:02 mnemosyne kernel: NVRM: GPU0 GSP RPC buffer contains function 4100 (RC_TRIGGERED) sequence 0 and data 0x0000000000000001 0x000000000000002d.
Oct 03 18:00:02 mnemosyne kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Oct 03 18:00:02 mnemosyne kernel: NVRM: entry function sequence data0 data1 ts_start ts_end duration actively_polling
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0 76 GSP_RM_CONTROL 6554 0x0000000020800a56 0x000000000000005c 0x000640432f1f20c6 0x0000000000000000 y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -1 76 GSP_RM_CONTROL 6553 0x000000002080a0d1 0x00000000000007e8 0x000640432f1e7421 0x000640432f1e7f67 2886us
Oct 03 18:00:02 mnemosyne kernel: NVRM: -2 76 GSP_RM_CONTROL 6552 0x000000002080a0d1 0x00000000000007e8 0x000640432f1d3338 0x000640432f1d3f1b 3043us
Oct 03 18:00:02 mnemosyne kernel: NVRM: -3 76 GSP_RM_CONTROL 6551 0x000000002080a0d1 0x00000000000007e8 0x000640432f18d3b7 0x000640432f18df87 3024us
Oct 03 18:00:02 mnemosyne kernel: NVRM: -4 76 GSP_RM_CONTROL 6550 0x000000002080a0d1 0x00000000000007e8 0x000640432f170018 0x000640432f170ca7 3215us
Oct 03 18:00:02 mnemosyne kernel: NVRM: -5 76 GSP_RM_CONTROL 6549 0x000000002080a0d1 0x00000000000007e8 0x000640432f0b5d07 0x000640432f0b5fda 723us
Oct 03 18:00:02 mnemosyne kernel: NVRM: -6 76 GSP_RM_CONTROL 6548 0x000000002080a0d1 0x00000000000007e8 0x000640432f0a171b 0x000640432f0a1eef 2004us
Oct 03 18:00:02 mnemosyne kernel: NVRM: -7 76 GSP_RM_CONTROL 6547 0x000000002080a0d1 0x00000000000007e8 0x000640432f05b7aa 0x000640432f05bf78 1998us
Oct 03 18:00:02 mnemosyne kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Oct 03 18:00:02 mnemosyne kernel: NVRM: entry function sequence data0 data1 ts_start ts_end duration during_incomplete_rpc
Oct 03 18:00:02 mnemosyne kernel: NVRM: 0 4100 RC_TRIGGERED 0 0x0000000000000001 0x000000000000002d 0x0006404330e9f7c1 0x0006404330e9f7c6 5us y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -1 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x0006404330e9ee71 0x0006404330e9ee74 3us y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -2 4100 RC_TRIGGERED 0 0x0000000000000001 0x000000000000002d 0x0006404330e9dd78 0x0006404330e9dd7b 3us y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -3 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x0006404330e9d4b8 0x0006404330e9d4ba 2us y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -4 4100 RC_TRIGGERED 0 0x0000000000000001 0x000000000000002d 0x0006404330e9c1ec 0x0006404330e9c1f0 4us y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -5 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x0006404330e9b8ef 0x0006404330e9b8f1 2us y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -6 4100 RC_TRIGGERED 0 0x0000000000000001 0x000000000000002d 0x0006404330e9a735 0x0006404330e9a739 4us y
Oct 03 18:00:02 mnemosyne kernel: NVRM: -7 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x0006404330e99e53 0x0006404330e99e55 2us y
Oct 03 18:00:02 mnemosyne kernel: CPU: 8 UID: 0 PID: 1185 Comm: nv_queue Tainted: G W OE 6.16.3-76061603-generic #202508231538~1759252525~24.04~c08ae99 PREEMPT(voluntary)
Oct 03 18:00:02 mnemosyne kernel: Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Oct 03 18:00:02 mnemosyne kernel: Hardware name: ASUS System Product Name/TUF GAMING B850-PLUS WIFI, BIOS 1028 04/29/2025
Oct 03 18:00:02 mnemosyne kernel: Call Trace:
Oct 03 18:00:02 mnemosyne kernel: <TASK>
Oct 03 18:00:02 mnemosyne kernel: dump_stack_lvl+0x76/0xa0
Oct 03 18:00:02 mnemosyne kernel: dump_stack+0x10/0x20
Oct 03 18:00:02 mnemosyne kernel: os_dump_stack+0xe/0x20 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: _kgspRpcRecvPoll+0x650/0x820 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: _issueRpcAndWait+0xdd/0x970 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel: ? osGetCurrentThread+0x26/0x60 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? rmDeviceGpuLockIsOwner+0x29/0x90 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel: ? os_mem_set+0x14/0x20 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel: rpcRmApiControl_GSP+0x76f/0x940 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel: gpuLogOobXidMessage_KERNEL+0x10e/0x140 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: nvErrorLog2+0xa6/0x100 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: nvErrorLog_va+0x42/0x50 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: _gpuRefreshRecoveryActionInLock+0x134/0x170 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? _gpuRefreshRecoveryActionInLock+0xf7/0x170 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct 03 18:00:02 mnemosyne kernel: rm_execute_work_item+0x121/0x1d0 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: os_execute_work_item+0x28/0x90 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: _main_loop+0x7e/0x140 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: ? __pfx__main_loop+0x10/0x10 [nvidia]
Oct 03 18:00:02 mnemosyne kernel: kthread+0x10a/0x230
Oct 03 18:00:02 mnemosyne kernel: ? __pfx_kthread+0x10/0x10
Oct 03 18:00:02 mnemosyne kernel: ret_from_fork+0x121/0x140
Oct 03 18:00:02 mnemosyne kernel: ? __pfx_kthread+0x10/0x10
Oct 03 18:00:02 mnemosyne kernel: ret_from_fork_asm+0x1a/0x30
Oct 03 18:00:02 mnemosyne kernel: </TASK>
Oct 03 18:00:02 mnemosyne kernel: NVRM: _kgspLogXid119: ********************************************************************************
Oct 03 18:00:02 mnemosyne kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 6554!
Oct 03 18:00:02 mnemosyne kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from pRmApi->Control(pRmApi, pGpu->hInternalClient, pGpu->hInternalSubdevice, NV2080_CTRL_CMD_INTERNAL_LOG_OO>
Oct 03 18:00:02 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
Oct 03 18:00:17 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:22 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:26 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:31 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:35 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:40 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:44 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:47 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 119, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 6555 (0x2080a0d1 0x7e8).
Oct 03 18:00:47 mnemosyne kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 6555!
Oct 03 18:00:48 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:53 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:00:57 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:02 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:06 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:10 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:15 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:19 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:24 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:28 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2168, name=cosmic-files-ap, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:32 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=1154, name=nvidia-modeset/, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 6556 (0x20802801 0x4).
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset @ kernel_gsp.c:2366
Oct 03 18:01:32 mnemosyne kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 6556!
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:180
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ vaspace_api.c:573
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:180
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ vaspace_api.c:573
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259
[many more failing assertions follow]
To Reproduce
Low activity use of the system with the 580.82 driver.
Bug Incidence
Sometimes
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz this is the output of sudo nvidia-bug-report.sh added for completeness, but I don't think it covers the freeze, because it was executed after a subsequent reboot.
More Info
Switching to tty3 also fails (unclear if related), which makes recovery hard
Oct 03 18:01:32 mnemosyne systemd[1]: Started [email protected] - Getty on tty3.
Oct 03 18:01:32 mnemosyne kernel: fbcon: Taking over console
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: Console: switching to colour frame buffer device 160x45
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:5187
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:4129
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:611
Oct 03 18:01:32 mnemosyne kernel: NVRM: vaspaceapiConstruct_IMPL: Could not construct VA space. Status 62
Oct 03 18:01:32 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:180
Oct 03 18:01:33 mnemosyne kernel: NVRM: vaspaceapiConstruct_IMPL: Could not construct VA space. Status 62
[... more assertions failing ...]
Oct 03 18:01:34 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:1375
Oct 03 18:01:35 mnemosyne systemd[1]: run-user-1000-gvfs.mount: Deactivated successfully.
Oct 03 18:01:35 mnemosyne systemd[1]: run-user-1000-doc.mount: Deactivated successfully.
Oct 03 18:01:35 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:36 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:5187
Oct 03 18:01:36 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:4129
Oct 03 18:01:36 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:611
Oct 03 18:01:36 mnemosyne kernel: NVRM: vaspaceapiConstruct_IMPL: Could not construct VA space. Status 62
Oct 03 18:01:36 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:36 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:36 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:01:37 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:37 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:37 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:37 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:37 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:41 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:41 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:41 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:41 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:41 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:44 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x00000062
Oct 03 18:01:46 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:46 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:46 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:46 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:46 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:50 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:50 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:50 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:50 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:50 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:54 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x00000062
Oct 03 18:01:54 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:54 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:54 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:54 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:54 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:01:59 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:01:59 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:59 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:01:59 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:01:59 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:02:02 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: The requested configuration of display devices (Asustek Computer Inc VG27AQL3A (DP-0)) is not supported on this GPU.
Oct 03 18:02:03 mnemosyne kernel: NVRM: Xid (PCI:0000:01:00): 109, channel 0x00000014, errorString CTX SWITCH TIMEOUT, Info 0x184003
Oct 03 18:02:03 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:02:03 mnemosyne kernel: nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7a53f058 flags=0x0020]
Oct 03 18:02:03 mnemosyne kernel: NVRM: nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:588
Oct 03 18:02:03 mnemosyne kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1004 (RC_TRIGGERED) from GPU0: status=0x21
Oct 03 18:02:04 mnemosyne kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x00000062
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:1375
Oct 03 18:02:06 mnemosyne kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
I assume related to https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446 from version 525.85.05.
same issue please do let me know if you find solution
I had a similar problem after using the GPU for 45 seconds with comfyui. The whole thing frozen and device doesn't show up on nvidia-smi. then it works only after reboot. I have nvidia pro 6000 Blackwell paid hell amount of money just to get driver error and I can't even use it. I tried driver 580 575 and 570 (open) everything tend to fail in the end.
I uploaded my failure report which was perfectly captured. in issue #949