GPU lost from the bus [NV_ERR_GPU_IS_LOST][NV_ERR_GPU_IN_FULLCHIP_RESET] / Unable to access opcode bytes @ Zotac RTX 4090
NVIDIA Open GPU Kernel Modules Version
575.64.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Ubuntu 24.04.3 LTS
Kernel Release
6.14.0-27-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
NVIDIA GeForce RTX 4090
Describe the bug
I run CUDA program and within 1h the driver crashes. GPUs have enough power and system is not overheating (it's watercooled). I've tried different kernel versions, different driver versions, CMOS reset, default BIOS or different options, same issue every time. It happens with both proprietary and open kernel driver. When driver crashes it goes with blank text screen and message: "nvidia-modeset: ERROR: GPU: Error while waiting for GPU progress".
Here are some stack traces.
Call Trace 1
NVRM: VM: nv_free_pages: 0x1
NVRM: VM: nv_free_pages:3890: 0x00000000a5cab6f8, 1 page(s), count = 1, page_table = 0x000000002de50b39
NVRM: VM: nv_free_system_pages: 1 pages
NVRM: VM: nvidia_vma_release:101: 0x775fb8c90000 - 0x775fb8ca0000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x00000000a14af54d
NVRM: VM: nvidia_vma_release:101: 0x775fb8d82000 - 0x775fb8d92000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x00000000274e7f81
NVRM: VM: nvidia_vma_release:101: 0x775fb8d92000 - 0x775fb8da2000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x000000009357fad0
NVRM: VM: nvidia_vma_release:101: 0x775fbd00d000 - 0x775fbd00e000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000727b7b44, 0x0000000034deee6b
NVRM: VM: nv_alloc_release:1766: 0x00000000727b7b44, 1 page(s), count = 13, page_table = 0x0000000036527f65
NVRM: VM: nvidia_vma_release:101: 0x775fbd00e000 - 0x775fbd01e000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x0000000060415cca
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fb000 - 0x775fbf2fc000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000855cd83a, 0x00000000b5dddb94
NVRM: VM: nv_alloc_release:1766: 0x00000000855cd83a, 1 page(s), count = 13, page_table = 0x000000005a97ce7a
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fc000 - 0x775fbf2fd000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000c4abedf2, 0x000000001cf6f6b6
NVRM: VM: nv_alloc_release:1766: 0x00000000c4abedf2, 1 page(s), count = 13, page_table = 0x000000003ebe2652
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fd000 - 0x775fbf2fe000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000e88005e7, 0x00000000df9a72b8
NVRM: VM: nv_alloc_release:1766: 0x00000000e88005e7, 1 page(s), count = 13, page_table = 0x000000008757d474
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fe000 - 0x775fbf2ff000, 0x00001000 bytes @ 0x0000000000000000, 0x0000000081132f33, 0x000000008790cb97
NVRM: VM: nv_alloc_release:1766: 0x0000000081132f33, 1 page(s), count = 13, page_table = 0x00000000ae5fdcec
NVRM: VM: nvidia_vma_release:101: 0x775fbf2ff000 - 0x775fbf30f000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x00000000d0779534
WARNING: CPU: 12 PID: 200421 at nvidia/nv.c:5039 nvidia_dev_put+0xb1/0xc0 [nvidia]
Modules linked in: iptable_filter xt_comment iptable_nat nf_conntrack_netlink veth xt_MASQUERADE bridge stp llc xt_set ip_set xfrm_user xfrm_algo snd_seq_dummy snd_hrtimer nvidia_uvm(OE) overlay qrtr ip6t_REJECT xt_hl ip6t_rt ipt_REJECT xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nft_compat binfmt_misc nls_iso8859_1 ipmi_ssif nvidia_drm(POE) nvidia_modeset(OE) snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi amd_atl intel_rapl_msr intel_rapl_common snd_hda_intel amd64_edac edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio kvm_amd snd_hda_codec nvidia(OE) snd_usbmidi_lib snd_hda_core snd_ump snd_hwdep kvm snd_pcm spd5118 irqbypass snd_seq_midi snd_seq_midi_event polyval_clmulni polyval_generic ghash_clmulni_intel snd_rawmidi sha256_ssse3 sha1_ssse3 snd_seq aesni_intel crypto_simd mfd_aaeon eeepc_wmi cryptd asus_wmi snd_seq_device sparse_keymap snd_timer wmi_bmof platform_profile rapl drm_ttm_helper mc snd acpi_ipmi ttm
i2c_piix4 ipmi_si video soundcore ccp k10temp i2c_smbus ipmi_devintf joydev input_leds ipmi_msghandler gpio_amdpt mac_hid nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_masq nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sch_fq_codel nf_tables msr parport_pc ppdev lp parport efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_mirror dm_region_hash dm_log cdc_ether usbnet uas usb_storage mii hid_generic nvme i40e ahci thunderbolt libahci nvme_core nvme_auth libie wmi ucsi_acpi typec_ucsi typec usbhid hid
CPU: 12 UID: 1000 PID: 200421 Comm: pool-gnome-cale Tainted: P OE 6.14.0-27-generic #27~24.04.1-Ubuntu
Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 1203 07/18/2025
RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
NVRM: nvidia_close on GPU with minor number 255
NVRM: nvidia_ctl_close
Code: 31 d2 31 f6 31 ff e9 29 0d 08 dd 48 c7 c7 f0 35 6b c1 e8 f2 1e 46 de 5b 41 5c 41 5d 5d 31 c0 31 d2 31 f6 31 ff e9 0a 0d 08 dd <0f> 0b eb c2 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
RSP: 0018:ff53a0042213f910 EFLAGS: 00010202
RAX: 0000000000000026 RBX: ff391ece5bb18000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff53a0042213f860
RBP: ff53a0042213f928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ff391ece5bb186a8
R13: 0000000000000000 R14: ff391ece55d8d6a0 R15: ffffffffc16b3740
FS: 0000000000000000(0000) GS:ff391f4bbca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00006204224ec800 CR3: 00000002af4a9005 CR4: 0000000000f71ef0
PKRU: 55555554
NVRM: VM: nv_free_pages: 0x1
Call Trace:
NVRM: VM: nv_free_pages:3890: 0x000000009818b312, 1 page(s), count = 1, page_table = 0x00000000b24f713
NVRM: VM: nv_free_system_pages: 1 pages
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? aa_file_perm+0x13b/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? eventfd_read+0xdc/0x200
? security_file_permission+0x36/0x60
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? vfs_read+0x2a8/0x390
? srso_alias_return_thunk+0x5/0xfbef5
? ksys_read+0x9d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x79004a91b4cd
Code: Unable to access opcode bytes at 0x79004a91b4a3.
RSP: 002b:000079000dbfb7a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 00007900040029d0 RCX: 000079004a91b4cd
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 000078fffc000bb0
RBP: 000079000dbfb7c0 R08: 0000000000000000 R09: 000000007fffffff
R10: 00007900040029d0 R11: 0000000000000293 R12: 000000007fffffff
R13: 000079004af52c10 R14: 0000000000000001 R15: 000078fffc000bb0
Call Trace 2
Another one happened in the same second as the first one.
------------[ cut here ]------------
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 0
NVRM: nvidia_close on GPU with minor number 2
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 0
NVRM: nvidia_close on GPU with minor number 2
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 0
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 255
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_ctl_close
WARNING: CPU: 28 PID: 6705 at nvidia/nv.c:5039 nvidia_dev_put+0xb1/0xc0 [nvidia]
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: VM: nv_free_pages: 0x1
NVRM: VM: nv_free_pages:3890: 0x00000000c2b1c48f, 1 page(s), count = 1, page_table = 0x00000000bba9b7e5
NVRM: VM: nv_free_system_pages: 1 pages
Code: 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 62 e1 fe 28 6f 06 62 e1 fe 28 6f 4c 16 ff 62 e1 fe 28 7f 07 62 e1 fe 28 7f 4c 17 ff c3 <48> 8b 4c 16 f8 48 8b 36 48 89 37 48 89 4c 17 f8 c3 62 e1 fe 48 6f
CPU: 28 UID: 1000 PID: 6705 Comm: xdg-desktop-por Tainted: P W OE 6.14.0-27-generic #27~24.04.1-Ubuntu
Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 1203 07/18/2025
RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
Code: 31 d2 31 f6 31 ff e9 29 0d 08 dd 48 c7 c7 f0 35 6b c1 e8 f2 1e 46 de 5b 41 5c 41 5d 5d 31 c0 31 d2 31 f6 31 ff e9 0a 0d 08 dd <0f> 0b eb c2 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
RSP: 0018:ff53a004226b79a0 EFLAGS: 00010202
RAX: 0000000000000026 RBX: ff391ece5bb18000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff53a004226b78f0
RBP: ff53a004226b79b8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ff391ece5bb186a8
R13: 0000000000000000 R14: ff391ece55d8d6a0 R15: ffffffffc16b3740
FS: 0000000000000000(0000) GS:ff391f4bbd200000(0000) knlGS:0000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007dee4d1166e0 CR3: 000000014b673003 CR4: 0000000000f71ef0
PKRU: 55555554
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? fput+0x157/0x190
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? do_filp_open+0xd4/0x1a0
? srso_alias_return_thunk+0x5/0xfbef5
? putname+0x60/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? do_sys_openat2+0x9f/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7c37fa92725d
Code: Unable to access opcode bytes at 0x7c37fa927233.
RSP: 002b:00007c37d61fe808 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00006077beef6450 RCX: 00007c37fa92725d
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00006077beef6460
RBP: 00007c37d61fe840 R08: 0000000000000007 R09: 00007c37d00047e0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007c37d61ff648
R13: 0000000000000000 R14: 00006077beef6460 R15: 0000000000000002
</TASK>
Call Trace 3
Another one happened in the same second.
RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
PKRU: 55555554
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? __pfx_pollwake+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_ip_fixup+0x8f/0x1f0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x775fc631b4cd
Code: Unable to access opcode bytes at 0x775fc631b4a3.
RSP: 002b:0000775fa9e6ab60 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 000064dd00ef2670 RCX: 0000775fc631b4cd
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 0000775f98000da0
RBP: 0000775fa9e6ab80 R08: 0000000000000000 R09: 000000007fffffff
R10: 000064dd00ef2670 R11: 0000000000000293 R12: 000000007fffffff
R13: 0000775fc6752c10 R14: 0000000000000001 R15: 0000775f98000da0
</TASK>
Call Trace 4
And another one, same time.
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? sysvec_apic_timer_interrupt+0x57/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7539ec298d71
Code: Unable to access opcode bytes at 0x7539ec298d47.
Call Trace 5
Then this one a second later.
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? aa_file_perm+0x13b/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? eventfd_read+0xdc/0x200
? security_file_permission+0x36/0x60
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? vfs_read+0x2a8/0x390
? srso_alias_return_thunk+0x5/0xfbef5
? ksys_read+0x9d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x79004a91b4cd
Code: Unable to access opcode bytes at 0x79004a91b4a3.
RSP: 002b:000079000dbfb7a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 00007900040029d0 RCX: 000079004a91b4cd
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 000078fffc000bb0
RBP: 000079000dbfb7c0 R08: 0000000000000000 R09: 000000007fffffff
R10: 00007900040029d0 R11: 0000000000000293 R12: 000000007fffffff
R13: 000079004af52c10 R14: 0000000000000001 R15: 000078fffc000bb0
</TASK>
Call Trace 6
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? __futex_wait+0x160/0x1d0
? __pfx_futex_wake_mark+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? hrtimer_cancel+0x15/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wait+0x85/0x130
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_ip_fixup+0x8f/0x1f0
? do_futex+0x105/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? restore_fpregs_from_fpstate+0x3d/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x78480392725d
Code: Unable to access opcode bytes at 0x784803927233.
RSP: 002b:000078470dbfe948 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007847e4f2c000 RCX: 000078480392725d
RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007847e4f2c040
RBP: 00007847e4f2c040 R08: 0000000000000000 R09: ffffffffffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007847e4f2c150 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Call Trace 7
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
? srso_alias_return_thunk+0x5/0xfbef5
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? schedule+0x3f/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wait_queue+0x69/0xa0
? srso_alias_return_thunk+0x5/0xfbef5
? __futex_wait+0x160/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? timerqueue_del+0x31/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? __remove_hrtimer+0x52/0xb0
? srso_alias_return_thunk+0x5/0xfbef5
? hrtimer_try_to_cancel.part.0+0x55/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? hrtimer_cancel+0x21/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wait+0x85/0x130
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wake+0x89/0x190
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? restore_fpregs_from_fpstate+0x3d/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? sysvec_call_function_single+0x57/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7158b9e98d71
Code: Unable to access opcode bytes at 0x7158b9e98d47
Call Trace 8
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? fput+0x157/0x190
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? do_filp_open+0xd4/0x1a0
? srso_alias_return_thunk+0x5/0xfbef5
? putname+0x60/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? do_sys_openat2+0x9f/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7c37fa92725d
Code: Unable to access opcode bytes at 0x7c37fa927233
Call Trace 9
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0x1666/0x2650
do_syscall_64+0x7e/0x170
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? unix_seqpacket_recvmsg+0x43/0x70
? srso_alias_return_thunk+0x5/0xfbef5
? sock_recvmsg+0xde/0xf0
? filp_flush+0x8d/0xb0
? srso_alias_return_thunk+0x5/0xfbef5
? ____sys_recvmsg+0x111/0x230
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? ___sys_recvmsg+0x9c/0xf0
? do_sys_openat2+0x9f/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_ip_fixup+0x8f/0x1f0
? srso_alias_return_thunk+0x5/0xfbef5
? restore_fpregs_from_fpstate+0x3d/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x70d49aeee21d
Code: Unable to access opcode bytes at 0x70d49aeee1f3.
And there are 6 more stack trace which I can post if relevant.
To Reproduce
I run CUDA program and within 1h it crashes, that's the only way to reproduce it. Eventually I can run a burn-test and it can happen as well.
Bug Incidence
Always
nvidia-bug-report.log.gz
I can't send nvidia-bug-report, as when the driver crashes Xorg immediately after SEGV and I cannot do anything, only reboot is the option (e.g. via Magic SysRq key).
More Info
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 575.64.03 Debug Build (kenorb@3XS) Sun 10 Aug 12:36:54 BST 2025
GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
$ uname -r
6.14.0-27-generic
$ lsmod | rg nvidia
nvidia_uvm 2150400 0
nvidia_drm 135168 58
nvidia_modeset 2101248 18 nvidia_drm
nvidia 14741504 535 nvidia_uvm,nvidia_modeset
drm_ttm_helper 16384 2 nvidia_drm
video 77824 2 asus_wmi,nvidia_modeset
$ modinfo nvidia | head
filename: /lib/modules/6.14.0-27-generic/updates/dkms/nvidia.ko.zst
import_ns: DMA_BUF
alias: char-major-195-*
version: 575.64.03
supported: external
license: Dual MIT/GPL
firmware: nvidia/575.64.03/gsp_tu10x.bin
firmware: nvidia/575.64.03/gsp_ga10x.bin
srcversion: 8DBF4ED3568DB8FEA5B7834
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
$ modinfo nvidia_modeset | head
filename: /lib/modules/6.14.0-27-generic/updates/dkms/nvidia-modeset.ko.zst
version: 575.64.03
supported: external
license: Dual MIT/GPL
srcversion: 4E29AA9F8BB75D880663278
depends: video,nvidia
name: nvidia_modeset
retpoline: Y
vermagic: 6.14.0-27-generic SMP preempt mod_unload modversions
parm: output_rounding_fix:bool
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS
Release: 24.04
Codename: noble
$ nvidia-smi pci -i 0,1
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-329ede61-7982-f4f6-...-...)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-0535a00b-ecd6-8908-...-...)
$ cat /etc/modprobe.d/nvidia.conf
# /etc/modprobe.d/nvidia.conf
options nvidia NVreg_DynamicPowerManagement=0
options nvidia NVreg_EnableGpuFirmwareLogs=1 # Increased verbosity for debugging (set 2 for more)
options nvidia NVreg_EnablePCIeGen3=0 # Allow auto-negotiation to avoid PCIe issues
options nvidia NVreg_EnableResizableBar=1 # Keep if BIOS/GPU supports; test with 0 if issues
options nvidia NVreg_EnableStreamMemOPs=1 # Optional, keep for CUDA workloads
options nvidia NVreg_InitializeSystemMemoryAllocations=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_ResmanDebugLevel=1 # Increased verbosity for debugging (set 2 for more)
options nouveau modeset=0
blacklist nouveau
$ sudo nvflash --index 0 --version
NVIDIA Firmware Update Utility (Version 5.867.0)
Reading EEPROM (this operation may take up to 30 seconds)
Redundant Firmware : Instance 0 (Identical)
Sign-On Message : PG139 SKU 332 VGA BIOS
Build GUID : 23B22FE99D20451B9D3059226648D230
Build Number : 32193434
IFR Subsystem ID : 19DA-1675
Subsystem Vendor ID : 0x19DA
Subsystem ID : 0x1675
Version : 95.02.3C.40.1B
Image Hash : N/A
Product Name : GPU Board
Device Name(s) : Graphics Device
Board ID : 0x0475
Vendor ID : 0x10DE
Device ID : 0x2684
Hierarchy ID : Normal Board
Chip SKU : 301-0
Project : G139-0332
Build Date : 12/13/22
Modification Date : 05/05/23
UEFI Version : 0x7000B ( x64 )
UEFI Variant ID : 0x000000000000000B ( Unknown )
UEFI Signer(s) : Microsoft Corporation UEFI CA 2011
XUSB-FW Version ID : N/A
XUSB-FW Build Time : N/A
InfoROM Version : G002.0000.00.03
InfoROM Backup : Present
License Placeholder : Present
GPU Mode : N/A
CEC OTA-signed Blob : Not Present
# Note: All GPUs VBIOS is matching and it's the latest from Zotac (from 13/12/22).
I've commented out nvidia NVreg_EnablePCIeGen3=0 line, disabled NVreg_PreserveVideoMemoryAllocations and enabled NVreg_DynamicPowerManagement in /etc/modprobe.d/nvidia.conf, so:
options nvidia NVreg_DynamicPowerManagement=1 # Enabling allows the driver to throttle power during stalls, potentially preventing hangs
#options nvidia NVreg_EnablePCIeGen3=0 # Allow auto-negotiation to avoid PCIe issues
options nvidia NVreg_PreserveVideoMemoryAllocations=0 # Preserves video memory across suspend/resume but can interfere with driver cleanup during hangs or exits (#472)
Also reset BIOS to defaults, as before I was trying to enforce Gen3 to fix the crashes, but it didn't help.
After above changes, the system is stable now (survived 2 days uptime without crash). Before I did these previous settings due to other crashes, but let's see how it goes now.
System worked for a week, and start crashing again, several times in the same day. The configuration didn't change, nothing was changed or updated.
Similar call traces:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0x1666/0x2650
do_syscall_64+0x7e/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? filp_flush+0x8d/0xb0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? __sys_sendmsg+0x8d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? sysvec_call_function+0x57/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x72bd0971b4cd
Code: Unable to access opcode bytes at 0x72bd0971b4a3.
RSP: 002b:000072bcf8bfe820 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 00005d273e525050 RCX: 000072bd0971b4cd
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 000072bcb8000b90
RBP: 000072bcf8bfe840 R08: 0000000000000000 R09: 000000007fffffff
R10: 00005d273e525050 R11: 0000000000000293 R12: 000000007fffffff
R13: 000072bd0ae2dc10 R14: 0000000000000001 R15: 000072bcb8000b90
...
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0x1666/0x2650
do_syscall_64+0x7e/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? sysvec_apic_timer_interrupt+0x57/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7a0410210c31
Code: Unable to access opcode bytes at 0x7a0410210c07.
RSP: 002b:00007ffde0e62428 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007ffde0e62430 RCX: 00007a0410210c31
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
RBP: 00007ffde0e624c0 R08: ffffffffffffe5f0 R09: aaaaaaaaaaaaaa00
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000004e92
R13: 0000347c010a99c0 R14: 00007ffde0e62450 R15: 0000347c010a9200
</TASK>
---[ end trace 0000000000000000 ]---
NVRM: nvidia_close on GPU with minor number 3
NVRM: nvidia_close on GPU with minor number 3
NVRM: nvidia_close on GPU with minor number 255
NVRM: nvidia_ctl_close
...
It crashed several times in the same day.
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
__fput_sync+0x59/0x80
__x64_sys_close+0x3d/0x90
x64_sys_call+0x1a2d/0x2650
do_syscall_64+0x7e/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? __slab_free+0xdf/0x280
? __fput+0x1a2/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? kmem_cache_free+0x3c4/0x470
? srso_alias_return_thunk+0x5/0xfbef5
? __fput+0x1a2/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? __fput+0x1a2/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7c1682f1671c
Code: 0f 05 48 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d fc e8 40 1e f8 ff 8b 7d fc 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2c 89 d7 89 45 fc e8 a2 1e f8 ff 8b 45 fc c9
RSP: 002b:00007fff04eb5250 EFLAGS: 00000293 ORIG_RAX: 0000000000000
RAX: ffffffffffffffda RBX: 000059543d2d0040 RCX: 00007c1682f1671c
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000050
RBP: 00007fff04eb5260 R08: 000059543b779010 R09: 0000000000030000
R10: 000059543d2cfb70 R11: 0000000000000293 R12: 000059543d32bab8
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
</TASK>
---[ end trace 0000000000000000 ]---
NVRM: ioctl(0x29, 0x4eb5230, 0x10)
NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:179
NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ vaspace_api.c:538
NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ mem.c:179
NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259
...
Call Trace:
<TASK>
nvUvmInterfaceUnregisterGpu+0x2d/0x90 [nvidia]
uvm_gpu_release_locked+0x6d/0x70 [nvidia_uvm]
uvm_va_space_destroy+0x5dc/0x780 [nvidia_uvm]
uvm_release.isra.0+0x7f/0x180 [nvidia_uvm]
uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
uvm_release_entry+0x2d/0x40 [nvidia_uvm]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? ksys_read+0x9d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x76bd1f71b4cd
Code: Unable to access opcode bytes at 0x76bd1f71b4a3.
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259
NVRM: rmapiFreeWithSecInfo: Nv01Free: free failed; status: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000f)
NVRM: rmapiFreeWithSecInfo: Nv01Free: client:0xc1d00035 object:0xcaf4bb69
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
? srso_alias_return_thunk+0x5/0xfbef5
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? __seccomp_filter+0x368/0x570
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? __task_pid_nr_ns+0x6f/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? __count_memcg_events+0xd3/0x1a0
? srso_alias_return_thunk+0x5/0xfbef5
? count_memcg_events.constprop.0+0x2a/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? handle_mm_fault+0x1df/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_user_addr_fault+0x5d5/0x870
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x77b169aaa83f
Potentially related:
- GH-776 (NV_ERR_GPU_IS_LOST)
- There is no newer VBIOS update (version 95.02.3C.40.1B installed from 12/13/22). All GPUs have matching VBIOS.
- Tried changing NVreg_EnablePCIeGen3, NVreg_EnableResizableBar=0 and NVreg_DynamicPowerManagement=0 with different options, but didn't make any difference.
- Tried different Gen 3, 4, 5 modes (file and in BIOS), didn't help.
Another crash today:
NVRM: VM: nv_free_pages: 0x54
NVRM: VM: nv_free_pages:3890: 0x0000000040de6c3f, 84 page(s), count = 1, page_table = 0x00000000edfafb31
NVRM: VM: nv_free_system_pages: 84 pages
NVRM: uvmTerminateAccessCntrBuffer_IMPL: Unloading UVM Access counters failed (status=0x0000000f), proceeding...
NVRM: VM: nv_free_pages: 0x20
NVRM: VM: nv_free_pages:3890: 0x00000000710d37a2, 32 page(s), count = 1, page_table = 0x0000000095f66c3f
NVRM: VM: nv_free_system_pages: 32 pages
NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 10!
NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00035; hObject=0xcaf00001; paramsStatus=0x00000000; status=0x0000000f
NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 10!
NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00035; hObject=0xcaf00000; paramsStatus=0x00000000; status=0x0000000f
------------[ cut here ]------------
CPU: 19 UID: 1000 PID: 20832 Comm: cuda-EvtHandlr Tainted: G OE 6.14.0-28-generic #28~24.04.1-Ubuntu
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 1203 07/18/2025
RIP: 0010:nvidia_dev_put_uuid+0x55/0x60 [nvidia]
Code: de 4c 89 e7 e8 6a f1 25 00 85 c0 75 1d 48 8d bb a8 06 00 00 e8 ec c1 9c eb 5b 41 5c 5d 31 c0 31 d2 31 f6 31 ff e9 06 a0 5e ea <0f> 0b eb df 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ff6ea2e8cbfa7850 EFLAGS: 00010202
RAX: 0000000000000026 RBX: ff38c605367c7000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff6ea2e8cbfa77a0
RBP: ff6ea2e8cbfa7860 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ff6ea2e8c8f22140 R14: ff38c6149df3c000 R15: ff38c6149fede000
FS: 0000000000000000(0000) GS:ff38c6827cd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000076b8c5dc3000 CR3: 000000019898e004 CR4: 0000000000f71ef0
PKRU: 55555554
Call Trace:
<TASK>
nvUvmInterfaceUnregisterGpu+0x2d/0x90 [nvidia]
uvm_gpu_release_locked+0x6d/0x70 [nvidia_uvm]
uvm_va_space_destroy+0x5dc/0x780 [nvidia_uvm]
uvm_release.isra.0+0x7f/0x180 [nvidia_uvm]
uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
uvm_release_entry+0x2d/0x40 [nvidia_uvm]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? rseq_ip_fixup+0x8f/0x1f0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? __pfx_pollwake+0x10/0x10
? __pfx_pollwake+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? __pfx_pollwake+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_ip_fixup+0x8f/0x1f0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f216131b4cd
Code: Unable to access opcode bytes at 0x7f216131b4a3.
RSP: 002b:00007f1c729996f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 0000000000000000 RCX: 00007f216131b4cd
RDX: 0000000000000064 RSI: 000000000000000a RDI: 00007f1c6c000c20
RBP: 00007f1c72999710 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000064
R13: 00007f1c72999810 R14: 00007f1c6c000c20 R15: 000059cfe72e8b40
</TASK>
---[ end trace 0000000000000000 ]---
NVRM: VM: nv_free_pages: 0x1
NVRM: VM: nv_free_pages:3890: 0x000000006aea10f9, 1 page(s), count = 1, page_table = 0x000000008035a772
NVRM: VM: nv_free_system_pages: 1 pages
NVRM: VM: nv_free_pages: 0x1
NVRM: VM: nv_free_pages:3890: 0x00000000817aad26, 1 page(s), count = 1, page_table = 0x0000000047f43ffa
NVRM: VM: nv_free_system_pages: 1 pages
NVRM: VM: nv_free_pages: 0x5
NVRM: VM: nv_free_pages:3890: 0x00000000ba440fe1, 5 page(s), count = 1, page_table = 0x00000000354abc97
NVRM: VM: nv_free_contig_pages: 5 pages
Here is what Grok says:
The primary cause appears to be PCIe link instability, leading to the GPU becoming disconnected from the bus (NV_ERR_GPU_IS_LOST). This triggers failures in driver functions like uvmTerminateAccessCntrBuffer_IMPL, which attempts to disable access counters but proceeds despite the error, resulting in system crashes or freezes under load. Key reasons based on logs, hardware, and similar reports:
- PCIe Signal Integrity Issues: With 5x RTX 4090 GPUs on a Threadripper PRO platform, high PCIe generations (Gen4/5) and multiple cards can cause signal degradation, especially in watercooled setups where cable routing or slot usage might exacerbate noise. The "Error while waiting for GPU progress" and "GPU lost from the bus" indicate link drops during intensive CUDA operations using all GPUs.
- Driver/Kernel Handling: Using open kernel modules (version 575.64.03 on kernel 6.14), the driver detects the loss but can't recover cleanly, leading to assertions and warnings (e.g., nvidia_dev_put_uuid at nv.c:5115). The "Unable to access opcode bytes" suggests potential memory corruption from the sudden disconnect.
- Power and Configuration Factors: Even with 2x 2000W PSUs, power spikes from multiple 4090s (up to 450W each) combined with the 350W CPU could strain VRM or PCIe power delivery. BIOS settings like Resizable BAR, Dynamic Power Management, and PCIe Gen modes (which you've tried) influence this, but incomplete tuning (e.g., redriver parameters) may be the gap. Random intervals (stable for a week or frequent crashes) point to marginal stability under varying loads.
- AMD-Specific Quirks: Threadripper's many PCIe lanes (up to 128) and NUMA architecture can cause IOMMU or DF (Data Fabric) sync floods/errors if not optimized. Your config (e.g., NVreg_EnablePCIeGen3=auto, pcie_aspm=off) helps but doesn't address all, like redriver gain/EQ or C-states.
Error Message Context Likely Cause nvidia-modeset: ERROR: GPU: Error while waiting for GPU progress Occurs during CUDA workloads on all GPUs, leading to Xorg freeze or exit. PCIe bus error or GPU hang due to high load in multi-GPU setup; potential driver-kernel incompatibility or hardware instability. NV_ERR_GPU_IS_LOST (0x0000000f) Seen in dmesg logs during crashes, with assertions in rs_server.c and rmapiFreeWithSecInfo. GPU lost from bus, often PCIe link issues (e.g., lane errors, gen negotiation failures) or power fluctuations under load. WARNING: CPU: X PID: Y at nvidia/nv.c:5039 nvidia_dev_put+0xb1/0xc0 [nvidia] Repeated in kernel logs during driver unload/close. Driver reference count mismatch or cleanup failure, tied to GPU reset/loss events. uvm_gpu_release_locked+0x6d/0x70 [nvidia_uvm] UVM (Unified Virtual Memory) errors during VA space destroy. CUDA memory management failure post-GPU hang, exacerbated by multi-GPU access.
I've tried the following BIOS changes:
- Advanced > PCI Subsystem Settings: Re-Size BAR Support: [Disabled] (), including
options nvidia NVreg_EnableResizableBar=0innvidia.conf - Advanced > PCI Subsystem Settings: SR-IOV Support: [Disabled] (unless needed for virtualization)
- Advanced > AMD CBS > CPU Common Options: Global C-state Control: [Disabled] (reduces power-saving transitions that might cause instability). Power Supply Idle Control: [Typical Current Idle].
- Advanced > AMD CBS > NBIO Common Options: PCIe Ten Bit Tag Support: [Enabled]. Data Link Feature Cap: [Enabled]. NBIO RAS Control: [MCA] (for better error reporting). NBIO SyncFlood Generation: [Disabled] (to avoid floods). PCIe Aer Reporting Mechanism: [OS First].
Also appended pcie_port_pm=off amd_iommu=off /etc/default/grub, so the line looks like:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash initcall_debug nvidia-drm.modeset=1 loglevel=7 pci=pcie_bus_safe pcie_aspm=off pcie_port_pm=off amd_iommu=off"
then sudo update-grub.
However above changes didn't help.
I've also noticed these errors in kern.log:
$ rg -C5 'Failed to query display' kern.log
NVRM: nvidia_close on GPU with minor number 3
message repeated 2 times: [ NVRM: nvidia_close on GPU with minor number 3]
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
NVRM: nvidia_close on GPU with minor number 3
message repeated 2 times: [ NVRM: nvidia_close on GPU with minor number 3]
NVRM: nvidia_close on GPU with minor number 4
message repeated 2 times: [ NVRM: nvidia_close on GPU with minor number 4]
NVRM: nvidia_close on GPU with minor number 3
NVRM: ioctl(0x29, 0x11bf2e90, 0x10)
NVRM: VM: nv_free_pages: 0x400
NVRM: VM: nv_free_pages:3890: 0x0000000005718e41, 1024 page(s), count = 1, page_table = 0x0000000017312432
NVRM: VM: nv_free_system_pages: 1024 pages
NVRM: ioctl(0x58, 0x11bf2e40, 0x30)
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:0:0:0x0000000f
NVRM: ioctl(0x29, 0x11bf2e80, 0x10)
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:2:0:0x0000000f
NVRM: ioctl(0x4f, 0x11bf2e80, 0x20)
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
nvidia-modeset: WARNING: NVKMS Assert @src/nvkms-rm.c:1994:nvRmBeginEndModeset(): '!"Failed NV0073_CTRL_CMD_SPECIFIC_DISPLAY_CHANGE"'
nvidia-modeset: WARNING: GPU:3: NV0073_CTRL_CMD_SYSTEM_VRR_DISPLAY_INFO failed
nvidia-modeset: WARNING: NVKMS Assert @src/nvkms-hdmi.c:1145:RmSetELDAudioCaps(): '!"NV2080_CTRL_CMD_OS_UNIX_AUDIO_DYNAMIC_POWER failed"'
nvidia-modeset: ERROR: GPU:3: NvRmControl(NV0073_CTRL_CMD_DFP_SET_ELD_AUDIO_CAPS) failedreturn status = 15...
nvidia-modeset: WARNING: NVKMS Assert @src/nvkms-hdmi.c:1166:RmSetELDAudioCaps(): '!"NV2080_CTRL_CMD_OS_UNIX_AUDIO_DYNAMIC_POWER failed"'
message repeated 2 times: [ NVRM: nvidia_close on GPU with minor number 0]
NVRM: nvidia_close on GPU with minor number 2
message repeated 2 times: [ NVRM: nvidia_close on GPU with minor number 2]
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 4
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:0:0:0x0000000f
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:2:0:0x0000000f
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
nvidia-modeset: WARNING: NVKMS Assert @src/nvkms-rm.c:1994:nvRmBeginEndModeset(): '!"Failed NV0073_CTRL_CMD_SPECIFIC_DISPLAY_CHANGE"'
nvidia-modeset: WARNING: GPU:3: NV0073_CTRL_CMD_SYSTEM_VRR_DISPLAY_INFO failed
nvidia-modeset: WARNING: NVKMS Assert @src/nvkms-hdmi.c:1145:RmSetELDAudioCaps(): '!"NV2080_CTRL_CMD_OS_UNIX_AUDIO_DYNAMIC_POWER failed"'
nvidia-modeset: ERROR: GPU:3: NvRmControl(NV0073_CTRL_CMD_DFP_SET_ELD_AUDIO_CAPS) failedreturn status = 15...
nvidia-modeset: WARNING: NVKMS Assert @src/nvkms-hdmi.c:1166:RmSetELDAudioCaps(): '!"NV2080_CTRL_CMD_OS_UNIX_AUDIO_DYNAMIC_POWER failed"'
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
message repeated 23 times: [ NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!]
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
message repeated 4 times: [ NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!]
NVRM: ioctl(0x29, 0xfb5b5fb0, 0x10)
NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
message repeated 8 times: [ NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!]
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10453
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10453
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:0:0:0x0000000f
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10453
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:2:0:0x0000000f
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10453
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10453
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
nvidia-modeset: ERROR: GPU:3: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10453
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
nvidia-modeset: WARNING: NVKMS Assert @src/nvkms-rm.c:1994:nvRmBeginEndModeset(): '!"Failed NV0073_CTRL_CMD_SPECIFIC_DISPLAY_CHANGE"'
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10453
NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi->Control(pRmApi, retainedChannel->session->handle, retainedChannel->rmSubDevice->subDeviceHandle, NV2080_CTRL_CMD_GPU_EVICT_CTX, ¶ms, sizeof(params)) @ nv_gpu_ops.c:10473
I have similar issue with one of my 3090s (founders). It will disconnect when loading distributed wan or during NCCL LLM inference. I was going to try different PCIE slot, already tried different riser cable. Only one GPU does this. A suspend/resume then hangs. I have to reboot the system, not even unpower the cards.
If it happened to me so many times I'd be going crazy too. Never seen this on the proprietary driver.