open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

driver 575.51.03 crashes almost immediately

Open de-wim opened this issue 8 months ago • 5 comments

NVIDIA Open GPU Kernel Modules Version

575.51.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora Linux 41 (Workstation Edition)

Kernel Release

Linux winston 6.14.4-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 25 15:45:16 UTC 2025 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA RTX A5000 Laptop GPU (UUID: GPU-0946ec38-a612-7b0e-8df3-0e64fe5b111e)

Describe the bug

Kernel driver crashes almost immediately, with nvidia-smi -q -x reporting "GPU requires reset" in most fields.

nvidia-smi -r reports Resetting GPU 00000000:01:00.0 is not supported.

Here is my kernel log (I grepped for lines containing nv, so there is some collateral damage included): nv_ok_crash.txt

I have the following non-standard, Nvidia related repositories enabled on my system:

[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[cuda-fedora41-x86_64]
name=cuda-fedora41-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/D42D0685.pub

To Reproduce

Happens immediately after logging into my desktop (perhaps earlier, hard to verify).

I am not running my desktop on the Nvidia GPU, instead the card is mostly used for offloading graphics & compute.

I am polling nvidia-smi about once a minute in a task bar widget to display core, memory and power usage information, which could be related to the problem.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

de-wim avatar May 03 '25 15:05 de-wim

Just verified by disabling my status bar script, invoking nvidia-smi is not required to trigger this bug - it happens regardless.

de-wim avatar May 03 '25 15:05 de-wim

Switching back to 570.133.20 solves the issue, so I guess this is a regression:

NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 570.133.20 Release Build works fine

de-wim avatar May 04 '25 07:05 de-wim

Any fan monitor/control software? During boot-up I'm seeing:

May 04 15:39:34 cachyos kernel: NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.
May 04 15:39:34 cachyos kernel: NVRM: GPU at PCI:0000:03:00: GPU-3257796a-90b3-1ff9-ff72-5c3e77f1f78b
May 04 15:39:34 cachyos kernel: NVRM: Xid (PCI:0000:03:00): 154, GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)
May 04 15:39:34 cachyos kernel: NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.

... but only if CoolerControl launches at just the right time. If delayed by several seconds then all is well.

richardm1 avatar May 04 '25 23:05 richardm1

Any fan monitor/control software?

Not that I'm aware of

de-wim avatar May 05 '25 07:05 de-wim

I was wrong. I've disabled coolercontrol yet CUDA is still broken.

kernel: NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.

If I run:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

...then start my CUDA app within 1-2 secs it'll work about 50% of the time. I suspect something on my system is polling the GPU and breaking this.

Edit: Figured it out. The culprit was the kernel parameter init_on_alloc=0.

richardm1 avatar May 11 '25 13:05 richardm1