driver 575.51.03 crashes almost immediately
NVIDIA Open GPU Kernel Modules Version
575.51.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [x] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Fedora Linux 41 (Workstation Edition)
Kernel Release
Linux winston 6.14.4-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 25 15:45:16 UTC 2025 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
GPU 0: NVIDIA RTX A5000 Laptop GPU (UUID: GPU-0946ec38-a612-7b0e-8df3-0e64fe5b111e)
Describe the bug
Kernel driver crashes almost immediately, with nvidia-smi -q -x reporting "GPU requires reset" in most fields.
nvidia-smi -r reports Resetting GPU 00000000:01:00.0 is not supported.
Here is my kernel log (I grepped for lines containing nv, so there is some collateral damage included):
nv_ok_crash.txt
I have the following non-standard, Nvidia related repositories enabled on my system:
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[cuda-fedora41-x86_64]
name=cuda-fedora41-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/D42D0685.pub
To Reproduce
Happens immediately after logging into my desktop (perhaps earlier, hard to verify).
I am not running my desktop on the Nvidia GPU, instead the card is mostly used for offloading graphics & compute.
I am polling nvidia-smi about once a minute in a task bar widget to display core, memory and power usage information, which could be related to the problem.
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
No response
Just verified by disabling my status bar script, invoking nvidia-smi is not required to trigger this bug - it happens regardless.
Switching back to 570.133.20 solves the issue, so I guess this is a regression:
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 570.133.20 Release Build works fine
Any fan monitor/control software? During boot-up I'm seeing:
May 04 15:39:34 cachyos kernel: NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.
May 04 15:39:34 cachyos kernel: NVRM: GPU at PCI:0000:03:00: GPU-3257796a-90b3-1ff9-ff72-5c3e77f1f78b
May 04 15:39:34 cachyos kernel: NVRM: Xid (PCI:0000:03:00): 154, GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)
May 04 15:39:34 cachyos kernel: NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.
... but only if CoolerControl launches at just the right time. If delayed by several seconds then all is well.
Any fan monitor/control software?
Not that I'm aware of
I was wrong. I've disabled coolercontrol yet CUDA is still broken.
kernel: NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.
If I run:
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
...then start my CUDA app within 1-2 secs it'll work about 50% of the time. I suspect something on my system is polling the GPU and breaking this.
Edit: Figured it out. The culprit was the kernel parameter init_on_alloc=0.