open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

GPU resumes to D0 immediately after ending transition to D3Cold (RTD3)

Open imaGuru opened this issue 5 months ago • 39 comments

NVIDIA Open GPU Kernel Modules Version

575.64.03 575.64.05-1 580.76.05

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.15.7-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 17 Jul 2025 21:05:29 +0000 x86_64 GNU/Linux 6.12.39-1 6.16.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 23 Aug 2025 15:32:49 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4060 Laptop GPU (UUID: GPU-a0338cc2-52ca-fc18-d87f-0a5df1c5ca22)

Describe the bug

Laptop: Lenovo Legion Slim 5 16ahp9 CPU: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics

dGPU can enter suspended state and turn off memory but won't stay cold. dGPU is woken up immediately after ending transition to D3cold. Waiting 15s causes GPU to sleep again, only to repeat the cycle. This behaviour persists even after killing all graphical interface (gdm/gnome/wayland/xorg) and no applications are using the GPU.

nvidia_drm module can be removed with modprobe -r when not running gnome (wayland/xorg). After removal the GPU suspends normally. Starting GDM and gnome with nvidia_drm removed works - running apps on dGPU also seems to work. Enabling nvidia-persistenced, nvidia-powerd or modprobing nvidia_drm brings back the bug with 15s turn on/off cycle

Suspending system to ram and waking up the laptop allows the GPU to power down correctly once for indefinite amount of time, until the first wakeup. After the bug manifests as usual.

These look like the same or very similar issues:

  • https://forums.developer.nvidia.com/t/nvidia-gpu-fails-to-power-off-prime-razer-blade-14-2022/250023/40
  • https://forums.developer.nvidia.com/t/4070-555-and-560-drivers-wont-stay-in-d3cold-lenovo-legion-slim-5/302967
  • https://www.markwatkinson.com/knowledge/linux/nvidia-dgpu-power/#known-issues - exactly the same isssue

Best workaround which allows the dGPU to stay in D3Cold although with memory turned on, but in self refresh mode:

options nvidia NVreg_DynamicPowerManagement=0x02
options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=0
# Enable runtime PM for NVIDIA VGA/3D controller devices on driver bind
ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="auto"
ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="auto"

# Disable runtime PM for NVIDIA VGA/3D controller devices on driver unbind
ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="on"
ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="on"

Windows doesn't have this issue.

To Reproduce

  1. Boot into system gdm/gnome (wayland or xorg) (GPU goes into cold state and wakes up during booting into GUI - runtime_suspended_time > 0). If GPU was not d3cold, let it go into d3cold and wake it after up (eg. with nvidia-smi)
  2. Wait 15s for GPU to go to sleep.
  3. GPU wakes up immediately after suspending.
  4. Cycle continues every 15s, gpu can no longer stay d3cold longer than 1s

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

imaGuru avatar Jul 20 '25 12:07 imaGuru

I have the same issue, and it causes my laptop to overheat! I had placed in my bag thinking the laptop was in sleep, but it almost set fire to it.. this should be fixed asap.

francoism90 avatar Jul 25 '25 14:07 francoism90

This sounds like its similar in nature to the problem I am trying to fix here: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/759#issuecomment-2565928704

If you load nvidia_drm with the kernel parameters modeset=1 fbdev=0, are you able to reproduce the issue?

Binary-Eater avatar Jul 25 '25 20:07 Binary-Eater

Unfortunately, yes. The issue persists, even with fbdev=0.

> cat /sys/module/nvidia_drm/parameters/modeset
Y
> cat /sys/module/nvidia_drm/parameters/fbdev 
N

Arch Wiki mentions:

Note: Kernels officially supported by Arch enable simpledrm, while NVIDIA driver requires efifb or vesafb when nvidia_drm.fbdev is disabled/unavailable (version < 545).

I do not know if this has any impact on your suggestion

Only semiworkable solution so far is suspending the laptop and after wakeup the first d3cold state holds correctly, subsequent do not.

imaGuru avatar Jul 26 '25 11:07 imaGuru

Thanks @imaGuru. That suggests to me that there may be other paths in nvidia-drm where we may be making allocations that have certain requirements, preventing the GPU from being able to stay in RTD3. It would help me if we could conduct the following test.

  1. Reload nvidia-drm with modeset=0 parameter
  2. Launch GNOME (using Xorg)
  3. Re-conduct the experiment to see if the GPU now stays in RTD3

If my theory is correct, I would expect this to enable the GPU to stay in RTD3.

Binary-Eater avatar Jul 26 '25 18:07 Binary-Eater

Hey. Unfortunately, still the same result - changing modeset, or fbdev doesn't have any effect on my issue (d3cold holds only for 1s).

I did however check the issue with external monitors. After connecting external monitor and disconnecting it, GPU cannot enter d3cold at all (even the broken 1s d3colds). Reloading nvidia_drm (modeset, fbdev values don't matter) fixes that - GPU can enter d3cold, but because of my issue they only last 1s :(

imaGuru avatar Jul 28 '25 08:07 imaGuru

Thanks @imaGuru ,

That's quite peculiar. The reason I wanted to run this experiment is because with modeset=0, nvidia-drm does almost nothing. Would you mind sharing a nvidia-bug-report.sh dump in the repro-ing case with modeset=0?

Binary-Eater avatar Jul 28 '25 13:07 Binary-Eater

Sure, here you go modeset=0:

nvidia-bug-report.log.gz

To recap: first d3cold after boot or laptop suspend to ram holds, subsequent do not (last only 1s then wakeup). Removing nvidia_drm and stoping nvidia_powerd (removing just nvidia_drm with nvidia_powerd left running, keeps the gpu in the wakeup cycle) allows GPU to stay d3cold after first wakeup

imaGuru avatar Jul 29 '25 09:07 imaGuru

@imaGuru If I understand you correctly, it works when setting modeset=0?

I notice mime is set like this: https://rpmfusion.org/Howto/NVIDIA?highlight=%28%5CbCategoryHowto%5Cb%29#OSTree_.28Silverblue.2FKinoite.2Fetc.29

I'll try with modeset=0 to verify if this solves the issue, because it wakes my laptop every time after going to sleep (NVIDIA Optimus powered).

francoism90 avatar Jul 30 '25 12:07 francoism90

@francoism90 no, modset or fbdev do not change anything for me. I think I have tried every possible configuration of Nvidia driver, xorg and wayland, nothing works unfortunately. RTD3 works only the first time it triggers, after first wakeup it does enter rtd3cold but can never stay cold longer than 1s. The result is that it constantly powers off and on and seems to waste even more power, so it's better to disable rt3d or take great care not to wake the GPU after first sleep since boot/laptop suspend.

imaGuru avatar Jul 30 '25 16:07 imaGuru

@imaGuru Confirmation here - the nomodeset=0 doesn't change anything.

It's weird it seems to sleep fine when being connected on my dock (USB-C). But when I unplug it, and close the laptop - it wakes up again (but without you knowing the screen is back on).

francoism90 avatar Jul 31 '25 18:07 francoism90

@Binary-Eater anything interesting in the logs? The weird thing is that the problem persists even when nvidia_drm is unloaded but nvidia-powerd is running. That would suggest to me that the problem is somewhere deeper in the driver? Maybe dGPU is improperly put to sleep and after wakeup something is not setup right?

It seems that new lenovo legion's suffer from this. My old y540 with 1660ti and legion 5 with 3050ti all worked fine and GPUs stay d3cold with no problem on Linux.

I also get these error's from nvidia-powerd (maybe related to some improper power managment of the card?):

sie 04 10:31:12 sanctuary540 nvidia-powerd[895]: ERROR! Error in processParam call : -1 16   13
sie 04 10:31:12 sanctuary540 nvidia-powerd[895]: ERROR! Error in processParam call : -1 16   13
sie 04 10:31:12 sanctuary540 nvidia-powerd[895]: ERROR! Error in processParam call : -1 16   13

imaGuru avatar Aug 04 '25 08:08 imaGuru

@imaGuru I don't think the powered daemon is compatibel, you can find this error also on the issue tracker. I've disabled that service for now.

Sleep doesn't work at all. It's also dangerous, because you get the impression it goes to sleep, but it wakes up when you put your laptop in your bag.

francoism90 avatar Aug 04 '25 09:08 francoism90

@francoism90 Can you link the issue with nvidia-powerd not being compatible? Search for "Error in processParam call" doesn't bring anything up in the issue tracker nor google. For me, aside from these errors, nvidia-powerd seems to work - I get depending on laptop mode: quiet 55W, balanced 60W and performance 115W (you can check in nvidia-smi).

As for laptop sleep/suspend to ram, for me, it usually works. However sometimes the GPU gets "stuck" and I get a black screen of death right before suspend or shutdown and laptop refuses to shutdown. Only way to power off is by holding the power button. Is this what you are experiencing also?

imaGuru avatar Aug 04 '25 10:08 imaGuru

@imaGuru I cannot find it either, maybe the comment has been removed?

This is where I found out about your issue: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/432

I still have it when enabling nvida-powered, so you may need to create an issue:

aug 04 13:18:46 lenovo systemd[1]: Started nvidia-powerd.service - nvidia-powerd service.
aug 04 13:18:46 lenovo nvidia-powerd[376985]: nvidia-powerd version:2.0 (build 1)
aug 04 13:18:46 lenovo nvidia-powerd[376985]: ERROR! Failed to allocate GPU device handle 0x59
aug 04 13:18:46 lenovo systemd[1]: nvidia-powerd.service: Deactivated successfully.

It now fails, I think that's what it has to do now?

Edit: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/432#issuecomment-1380446995

francoism90 avatar Aug 04 '25 11:08 francoism90

Only semiworkable solution so far is suspending the laptop and after wakeup the first d3cold state holds correctly, subsequent do not.

Weird, for me D3Cold seems to hold correctly indefinitely after a single suspend-wakeup cycle even after the GPU is woken up again by applications.

I have powerd enabled, but persistenced disabled. However, after manually starting persistenced after boot, the GPU still seems to back go into D3Cold properly even after being woken up. Same laptop and driver version but on Kernel 6.15.9 on Fedora 42.

I do not have the same issue with the Proprietary driver of the same version. nvidia-bug-report.log.gz

EDIT: After some testing, it seems like removing nvidia_drm did indeed work to prevent wakeups to D0, but the GPU oscillated between D3Hot and D3Cold instead of D0 and D3Cold.

None of the nvidia services were running.

Mulukulum avatar Aug 17 '25 06:08 Mulukulum

Updates to kernel and driver do not fix the issue: Kernel: 6.16.1-arch1-1 Driver: 580.76.05 No changes :/

Weird, for me D3Cold seems to hold correctly indefinitely after a single suspend-wakeup cycle even after the GPU is woken up again by applications.

@Mulukulum Indeed very weird... From your debug log it looks like you have the same laptop (although with windows preinstalled). Same VBIOS also. Your BIOS though is a bit older than mine (v1.20, mine 1.23). I did however update BIOS recently to see if it would fix the issue - it didn't :( .

Just to make sure: after laptop suspend your GPU works completely fine without any issue? Enters D3Cold and stays that way even after multiple wakeups to D0 ( watch -n1 cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_suspended_time increases steadily when in D3Cold)?

imaGuru avatar Aug 17 '25 13:08 imaGuru

Just to make sure: after laptop suspend your GPU works completely fine without any issue? Enters D3Cold and stays that way even after multiple wakeups to D0 ( watch -n1 cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_suspended_time increases steadily when in D3Cold)?

Correct, it works just fine without any issue and the suspended time value increases correctly. I used vkcube and nvidia-smi to wake up the GPU. I also tried running a render using handbrake. The GPU is able to go back to sleep after waking up in all of the methods.

Its just that I have to suspend once when starting up the laptop.

I haven't updated my BIOS since I got my laptop. I would consider doing it if I can find an archive of all available BIOSes released for my machine, but I'm unable to find it on the Lenovo website. There is an option you have to enable in the advanced configuration menu for downgrading your BIOS. You could try that, but since this issue doesn't happen on the Proprietary Kernel Module, I don't see why the BIOS would matter here.

Mulukulum avatar Aug 17 '25 13:08 Mulukulum

You could try that, but since this issue doesn't happen on the Open Kernel Module, I don't see why the BIOS would matter here.

Did you mean the "proprietary driver"? For me it happens on both: open and closed source drivers. It does however seem to be some kind of software issue, because your laptop "almost" works... (maybe some archlinux patches). Do you have any specific nvidia configurations? Which desktop environment do you use?

It seems that you can change the version number in this link to download older versions of the BIOS (link is from this site ) https://download.lenovo.com/consumer/mobiles/nrcn24ww.exe

I will try downgrading the BIOS sometime later and see if has any effect.

imaGuru avatar Aug 17 '25 14:08 imaGuru

Yes, sorry I meant the proprietary driver. I've edited my original comment. I use GNOME (Fedora 42) but the issue also occurs on KDE (Bazzite) but it also occurs without a DE altogether.

Curious that the issue still occurs on the proprietary drivers for you. I'm not sure how it works on Arch but Fedora's (more specifically, RPM Fusion's) Nvidia package made me think I was installing the nonfree proprietary drivers package but actually installs the free package.

EDIT: There was no difference between the new and old BIOSes for me.

@imaGuru These are the only "changes" I've made to my nvidia configuration. Kernel 6.15.9

/etc/modprobe.d/nvidia.conf

options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=0 
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=0

Mulukulum avatar Aug 17 '25 14:08 Mulukulum

@Mulukulum How did you installed the driver? I'm also on Fedora Kionite, and they should be the closed ones.

Bazzite uses a different upstream for nvidia, not rpmfusion (if I'm correct).

francoism90 avatar Aug 17 '25 15:08 francoism90

@francoism90 This comment details how to switch to the closed drivers. https://discussion.fedoraproject.org/t/testers-required-nvidia-driver-in-rpmfusion-nonfree-updates-testing/154781/38

Mulukulum avatar Aug 17 '25 15:08 Mulukulum

@Mulukulum Thanks! Going to try, and check if this actually fixes the issue.

Hmm, so this https://github.com/CheariX/silverblue-akmods-keys/issues/14, isn't needed anymore?

Edit: that doesn't work for me, it keeps stating:

$ modinfo -l nvidia
Dual MIT/GPL

francoism90 avatar Aug 17 '25 16:08 francoism90

Is the kmod actually getting rebuilt?

Maybe try

sudo akmods --rebuild --force --akmod nvidia

Mulukulum avatar Aug 17 '25 17:08 Mulukulum

@Mulukulum Hmm, it seems to fail:

2025/08/17 20:03:02 akmods: Installing newly built rpms
2025/08/17 20:03:02 akmods: DNF not found, using YUM instead.
/usr/sbin/akmods: line 362: yum: command not found
2025/08/17 20:03:02 akmods: Could not install newly built RPMs. You can find them and the logfile in:

francoism90 avatar Aug 17 '25 18:08 francoism90

@Mulukulum I managed to recreate your state. Now after suspending laptop to ram, D3Cold works even after GPU wakeups and connecting/disconnecting external display. I made a couple of changes so I will have to track down what exactly is necessary.

List of changes:

  • xorg for gdm and gnome-shell instead of wayland (gdm/gnome-shell freezes otherwise when I connect external display to HDMI)
  • early KMS with nvidia modules in initramfs
  • blacklisting nouveau in kernel parameters
  • removing nouveau from initramfs
  • modprobe.d/nvidia.conf with:
options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=0 
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=0

EDIT: Edited my configuration mistake: I thought I had NVreg_DynamicPowerManagementVideoMemoryThreshold=512 but infact it was 0

imaGuru avatar Aug 18 '25 08:08 imaGuru

Could you check if the issue still occurs on the Proprietary Modules?

I haven't had any issue with external monitors on Gnome-Wayland (I checked and I am running wayland) connected via HDMI (driven by the dGPU), but I don't use external monitors very often. I double checked and yes I did indeed have nouveau blacklisted.

It does seem like both fbdev and modeset parameters are enabled by default on boot.

mukul@16AHP9:~$ cat /proc/driver/nvidia/params | grep VideoMemory
PreserveVideoMemoryAllocations: 1
S0ixPowerManagementVideoMemoryThreshold: 0
DynamicPowerManagementVideoMemoryThreshold: 0

Mulukulum avatar Aug 18 '25 09:08 Mulukulum

Could you check if the issue still occurs on the Proprietary Modules?

I'm on 580.76.05 downloaded directly from NVIDIA - proprietary installation (license: NVIDIA) and the issue persists. However these packages are from pacman (maybe they change something, will have to check that later):

local/cuda 12.9.1-2
local/cudnn 9.11.0.98-3
local/egl-gbm 1.1.2.1-1
local/egl-wayland 4:1.1.20-1
local/egl-x11 1.0.3-1
local/lib32-nvidia-utils 580.76.05-1
local/libnvidia-container 1.17.8-1
local/libva-nvidia-driver 0.0.14-1
local/libvdpau 1.5-3
local/libxnvctrl 580.76.05-1
local/linux-firmware-nvidia 20250808-1
local/nvidia-prime 1.0-5
local/nvidia-settings 580.76.05-1
local/nvidia-utils 580.76.05-4
local/nvtop 3.2.0-1
local/opencl-nvidia 580.76.05-4

Weirdly I do have the same output as you for nvidia params:

cat /proc/driver/nvidia/params | grep VideoMemory
PreserveVideoMemoryAllocations: 1
S0ixPowerManagementVideoMemoryThreshold: 0
DynamicPowerManagementVideoMemoryThreshold: 0

Documentation says that the thresholds are by default set to 200. systemd-analyze cat-config modprobe.d show only my conf file with values 512, so why is it 0? Maybe because nvidia modules are loaded in initramfs stage without kernel params, and 0 is treated as undefined, so 200? I will have to investigate it further.

Relevant modprobe conf files from systemd-analyze:

# /usr/lib/modprobe.d/nvidia-sleep.conf
# https://download.nvidia.com/XFree86/Linux-x86_64/560.35.03/README/powermanagement.html#PreserveAllVide719f0
# Save and restore all video memory allocations.
options nvidia NVreg_PreserveVideoMemoryAllocations=1
#
# The destination should not be using tmpfs, so we prefer
# /var/tmp instead of /tmp
options nvidia NVreg_TemporaryFilePath=/var/tmp

# /usr/lib/modprobe.d/nvidia-utils.conf
blacklist nouveau
blacklist nova_core
blacklist nova_drm

# /etc/modprobe.d/nvidia.conf
options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=512 
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=512

Kernel params:

GRUB_CMDLINE_LINUX_DEFAULT="... nvidia_drm.modeset=1 nvidia_drm.fbdev=1 nouveau.blacklist=1 ... "

imaGuru avatar Aug 18 '25 09:08 imaGuru

Ok. Here is what I've got:

NVreg_DynamicPowerManagementVideoMemoryThreshold=0 disables turning off the video memory (memory will be in self refresh mode and still draw some power even when the GPU is in D3Cold). This is the only change that was necessary to achieve the same state as yours @Mulukulum . Due to misconfiguration on my part this parameter was not loaded correctly. Right now I have:

# /etc/modprobe.d/nvidia.conf
options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=0
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=1024

This configuration makes the GPU oscillate between waking up and powering down after boot. After suspending laptop to ram, GPU can stay in d3cold correctly, but without turning off the video memory.

With NVreg_DynamicPowerManagementVideoMemoryThreshold to default value of 200 the GPU can power down and turn off video memory, but only after laptop suspend to ram and only for the first time.

@Mulukulum Can you verify that you have no issue with RTD3 on propriety driver with this configuration:

options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=200 # or 512
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=1024

Specifically can you confirm that the GPU suspends and turns off the video memory (can be seen in the output of cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power )?

imaGuru avatar Aug 18 '25 13:08 imaGuru

The memory does say "Off" and it goes to D3Cold after a suspend but with this particular configuration

options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=200 # or 512
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=1024

Every single GPU access triggers the bug again and it doesn't go back to D3Cold in the proprietary drivers. I was able to get it working with my older config though.

Mulukulum avatar Aug 19 '25 05:08 Mulukulum

Hi @imaGuru @Mulukulum Thank you for sharing latest test results, would you mind sharing bug report from latest released driver in repro state.

amrit1711 avatar Aug 22 '25 09:08 amrit1711