open-gpu-kernel-modules
open-gpu-kernel-modules copied to clipboard
Nvidia 560.28.03-1 throwing kernel stack trace with linux kernels from 6.10.3 up to 6.10.9 or newer
NVIDIA Open GPU Kernel Modules Version
560.35.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [x] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Debian GNU/Linux trixie/sid
Kernel Release
6.10.9
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
NVIDIA GeForce RTX 4090 Laptop GPU
Describe the bug
I am getting lots of errors and kernel tainted with stack in dmesg with latest nvidia driver 560.28.03-1 and linux kernel 6.10.3 (for full log see nvidia-bug-report.log.gz included in this report) on GNU/Linux Debian setup.
Short summary:
- The error messages are consistently related to the function follow_pte+0x1de/0x200.
- In the call traces, we can see NVIDIA-related functions being called: nv_revoke_gpu_mappings+0x67/0xb0 [nvidia] RmHandleIdleSustained+0x39/0x130 [nvidia] rm_execute_work_item+0xe0/0x150 [nvidia] 3.The module list shows NVIDIA modules loaded: nvidia_uvm(OE) nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) The (OE) suffix likely indicates these are out-of-tree (externally built) modules and NVIDIA is the only OE module I have.
- The error is occurring in a kernel thread named "nv_queue", which is likely an NVIDIA driver thread.
- The warnings are being triggered at include/linux/rwsem.h:80, which suggests there might be an issue with how the NVIDIA driver is handling read-write semaphores in the kernel.
To Reproduce
Boot 6.10.9 kernel with latest official nvidia driver and check dmesg logs.
Bug Incidence
Always
nvidia-bug-report.log.gz
Above nvidia-bug-report.log.gz includes this but also pasting here for convinience
[ 50.485511] CPU: 14 PID: 1229 Comm: nv_queue Tainted: G W OE 6.10.9-amd64 #1 Debian 6.10.9-1
[ 50.485511] Hardware name: LENOVO 83AG/LNVNB161216, BIOS MHCN42WW 03/25/2024
[ 50.485511] RIP: 0010:follow_pte+0x20b/0x220
[ 50.485512] Code: 00 00 00 c0 eb 8b 49 8b 3c 24 e8 00 bf 91 00 e8 bb 5e e1 ff bd ea ff ff ff 5b 89 e8 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc <0f> 0b e9 1e fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90
[ 50.485513] RSP: 0018:ffffab47c10afb60 EFLAGS: 00010246
[ 50.485513] RAX: 0000000000000000 RBX: 00007fcbe8b8e000 RCX: ffffab47c10afba0
[ 50.485514] RDX: ffffab47c10afb98 RSI: 00007fcbe8b8e000 RDI: ffff9c8870728e70
[ 50.485514] RBP: ffffab47c10afbe0 R08: ffffab47c10afd38 R09: 0000000000000000
[ 50.485515] R10: 000000008040003c R11: 0000000000000000 R12: ffffab47c10afba0
[ 50.485515] R13: ffffab47c10afb98 R14: ffff9c8874afb180 R15: 0000000000000000
[ 50.485516] FS: 0000000000000000(0000) GS:ffff9c97b3300000(0000) knlGS:0000000000000000
[ 50.485516] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.485517] CR2: 00007f80c9c7b6b4 CR3: 00000001a2ce6000 CR4: 0000000000f50ef0
[ 50.485517] PKRU: 55555554
[ 50.485517] Call Trace:
[ 50.485518] <TASK>
[ 50.485518] ? __warn+0x80/0x120
[ 50.485519] ? follow_pte+0x20b/0x220
[ 50.485520] ? report_bug+0x164/0x190
[ 50.485521] ? handle_bug+0x3c/0x80
[ 50.485522] ? exc_invalid_op+0x17/0x70
[ 50.485523] ? asm_exc_invalid_op+0x1a/0x20
[ 50.485524] ? follow_pte+0x20b/0x220
[ 50.485525] follow_phys+0x4b/0x110
[ 50.485526] untrack_pfn+0x57/0x120
[ 50.485528] unmap_single_vma+0xa6/0xe0
[ 50.485529] zap_page_range_single+0x122/0x1d0
[ 50.485530] unmap_mapping_range+0x111/0x140
[ 50.485532] nv_revoke_gpu_mappings+0x67/0xb0 [nvidia]
[ 50.485584] RmHandleIdleSustained+0x39/0x130 [nvidia]
[ 50.485678] ? gpumgrGetGpu+0x69/0xa0 [nvidia]
[ 50.485781] rm_execute_work_item+0xe0/0x150 [nvidia]
[ 50.485882] ? os_execute_work_item+0x19/0x80 [nvidia]
[ 50.485934] _main_loop+0x8f/0x150 [nvidia]
[ 50.485991] ? __pfx__main_loop+0x10/0x10 [nvidia]
[ 50.486046] kthread+0xcf/0x100
[ 50.486048] ? __pfx_kthread+0x10/0x10
[ 50.486049] ret_from_fork+0x31/0x50
[ 50.486049] ? __pfx_kthread+0x10/0x10
[ 50.486050] ret_from_fork_asm+0x1a/0x30
[ 50.486051] </TASK>
[ 50.486052] ---[ end trace 0000000000000000 ]---
More Info
No response