open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Not entering D3cold on 4070 laptop (Lenovo Legion Slim 5 16APH8)

Open ngbomford opened this issue 1 year ago • 18 comments

NVIDIA Open GPU Kernel Modules Version

565.57.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.11.6-arch1-1

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4070 Laptop GPU

Describe the bug

Hi everyone,

After switching to the open driver I noticed my laptop (Lenovo Legion Slim 5 16APH8) does not enter D3cold, there are no processes running accessing the GPU at the time as this is after a fresh reboot. Running "watch -n 1 cat /sys/class/drm/card*/device/power_state" you can see that it's running constantly in D0 state, it switches for a second or so to D3cold, but immediately changes back to D0.

This issue is not present in the closed driver 565.57.01, after a reboot the laptop changes to D3cold after less than 30 seconds.

I only noticed because power consumption was higher than normal while on battery.

To Reproduce

Install nvidia-open-dkms package on Arch Linux

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

ngbomford avatar Nov 05 '24 23:11 ngbomford

Hey there, thanks for the report!

This issue is not present in the closed driver 565.57.01, after a reboot the laptop changes to D3cold after less than 30 seconds.

Looking at the attached log I see some logs from this run but I can't tell if it was with GSP enabled or disabled. Can you confirm this? If unsure, you can boot with the proprietary driver in the no-repro mode and just run nvidia-smi -q | grep GSP. Or you can run nvidia-bug-report.sh while in the no-repro mode and attach those logs as well.

From the logs we do see some errors that could be relevant. Could I maybe also trouble you to reload the open driver once with NVreg_RmMsg=":" and attach those logs too?

Thanks again!

mtijanic avatar Nov 06 '24 09:11 mtijanic

Thanks for the reply.

Looking at the attached log I see some logs from this run but I can't tell if it was with GSP enabled or disabled. Can you confirm this? If unsure, you can boot with the proprietary driver in the no-repro mode and just run nvidia-smi -q | grep GSP. Or you can run nvidia-bug-report.sh while in the no-repro mode and attach those logs as well.

From the proprietary driver, looks like GSP firmware is enabled: nvidia-smi -q | grep GSP GSP Firmware Version : 565.57.01

From the logs we do see some errors that could be relevant. Could I maybe also trouble you to reload the open driver once with NVreg_RmMsg=":" and attach those logs too?

I've attached the logs with NVreg_RmMsg=":" from the open driver. nvidia-bug-report.log.gz

Thanks for looking into this.

ngbomford avatar Nov 06 '24 10:11 ngbomford

I have the same issue on a 2019 Razer Blade with 2070 Max-Q.

I used to disable GSP firmware with proprietary drivers, with the following line in /etc/modprobe.d/nvidia.conf:

options nvidia "NVreg_EnableGpuFirmware=0"

This was required in order to enter D3Cold state. Now, nvidia-open does not support disabling GSP Firmware anymore, and I cannot enter D3Cold.

cat /proc/driver/nvidia/gpus/0000:01:00.0/power

Runtime D3 status:          Enabled (coarse-grained)
Video Memory:               Active

GPU Hardware Support:
 Video Memory Self Refresh: Not Supported
 Video Memory Off:          Supported

S0ix Power Management:
 Platform Support:          Not Supported
 Status:                    Disabled

Notebook Dynamic Boost:     Not Supported

cat /sys/class/drm/card[0,1]/device/power_state

D0
D0

nvidia-smi

Wed Jul 16 10:43:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64                 Driver Version: 575.64         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2070 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8              6W /   50W |       4MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3543      G   /usr/bin/gnome-shell                      1MiB |
+-----------------------------------------------------------------------------------------+

uname -a

Linux nicolas-blade 6.15.6-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu, 10 Jul 2025 15:38:04 +0000 x86_64 GNU/Linux

Any help would be appreciated.

NicolasThierion avatar Jul 16 '25 08:07 NicolasThierion

I do own a Lenovo laptop, and the problem seems BIOS (firmware) related. You can trick your BIOS in entering the 'Advanced Configuration mode' (use Google), and you'll notice this is actually disabled in the BIOS.

It works for me, after enabling D3cold and MUX-control in the BIOS:

$ cat /sys/class/drm/card*/device/power_state
D0
D3cold
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status 
suspended

I don't know if Lenovo somehow enables this after you install Windows, which is why it doesn't maybe work on Linux out-of-the-box?

francoism90 avatar Jul 25 '25 14:07 francoism90

I have a similar issue on the 16AHP9 on Version 575.64.05. The GPU does enter D3Cold, but it instantly wakes up to D0 again. By default, on my BIOS D3Cold and the MUX configuration were enabled correctly.

The really weird part of this is that this issue goes away ENTIRELY after suspending the laptop and waking it up again. So my current workaround is to put the laptop into a suspend state after boot, and it'll work just fine after that.

Here's my bug report file:

nvidia-bug-report.log.gz

This seems related to #905

Mulukulum avatar Aug 17 '25 05:08 Mulukulum

I have a similar issue on the 16AHP9 on Version 575.64.05.

The really weird part of this is that this issue goes away ENTIRELY after suspending the laptop and waking it up again.

Yes, this is the issue. Once it starts, suspending and then resuming will bring it back to d3cold, if anything accesses the dGPU, it'll start the bug all over again.

ngbomford avatar Aug 17 '25 06:08 ngbomford

Weird, for me the bug doesn't re-trigger after accessing the GPU.

Mulukulum avatar Aug 17 '25 06:08 Mulukulum

I have tbe same bug, even with the latest closed driver.

Could this be a kernel bug instead?

francoism90 avatar Aug 17 '25 09:08 francoism90

I do not have the same problem with the closed driver, it happens only on the open driver. I suspect these are different issues.

Mulukulum avatar Aug 17 '25 11:08 Mulukulum

I do not have the same problem with the closed driver, it happens only on the open driver. I suspect these are different issues.

Interesting, I have the same problem with both open and closed driver, all driver versions, including the latest 580.

Just curious, does your display have HDR? There is a difference between 16aph models with HDR, they don't support advanced Optimus under windows. Wondering if it's related to the hardware wiring somehow.

ngbomford avatar Aug 18 '25 03:08 ngbomford

@ngbomford Nope, my display doesn't have HDR. Are you sure you actually have the proprietary modules installed?

RPMFusion packages were named as though they were proprietary but It actually installed the open modules. Could you double check your version by using sudo dmesg | grep -i nvidia

If you're on the open modules you'll see something like NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64

Mulukulum avatar Aug 18 '25 05:08 Mulukulum

@ngbomford Nope, my display doesn't have HDR. Are you sure you actually have the proprietary modules installed?

Yes, and I've alternated between the two, same result, currently running proprietary.

pacman -Q | grep nvidia lib32-nvidia-utils 580.76.05-1 linux-firmware-nvidia 20250808-1 nvidia-dkms 580.76.05-4 nvidia-prime 1.0-5 nvidia-settings 580.76.05-1 nvidia-utils 580.76.05-4 opencl-nvidia 580.76.05-4

modinfo -l nvidia NVIDIA

ngbomford avatar Aug 18 '25 06:08 ngbomford

@ngbomford , could you check this comment and see if you're able to get the bug to trigger only once per boot.

It sounds like this and #905 are the same issue, so merging it into a single issue might be a good idea.

Mulukulum avatar Aug 18 '25 10:08 Mulukulum

@ngbomford , could you check this comment and see if you're able to get the bug to trigger only once per boot.

Interesting, I now have fully functional D3cold with proprietary driver using the following config:

options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=0
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=1024

Just to test, I woke GPU up 10 times and let it go back to D3cold successfully as well as systemctl suspend a few times to ensure that it eventually goes back to D3cold and stays.

In addition I plugged and unplugged from power, tried sleeping several times in while on and off of charger, again it always went back to D3cold.

Previously with proprietary drivers newer than 565.57.01 I could never get this to work at all.

ngbomford avatar Aug 19 '25 05:08 ngbomford

Interesting, I now have fully functional D3cold with proprietary driver using the following config:

@ngbomford Hi. What exactly do you mean by "fully functional"?

imaGuru avatar Aug 25 '25 09:08 imaGuru

Interesting, I now have fully functional D3cold with proprietary driver using the following config:

@ngbomford Hi. What exactly do you mean by "fully functional"?

@imaGuru "fully functional D3cold" as stated above.

ngbomford avatar Aug 25 '25 09:08 ngbomford

@ngbomford I'm asking because it is not entirely clear from your post above. So just to be sure, with the configuration above, your GPU can enter d3cold and stay there straight from boot or only after system suspend? Are you using drivers from nvidia installer or from pacman package nvidia/nvidia-open? Does changing to nvidia-open affect the issue? If you change the threshold to 200 or something larger does the GPU oscilate between d0 d3cold on boot and after system suspend it sleeps correctly for the first time but after wakeup (eg by nvidia-smi) the bug comes back? If possible can you attach an output of cat /proc/driver/nvidia/params and systemd-analyze cat-config modprobe.d | grep nvidia?

imaGuru avatar Aug 25 '25 09:08 imaGuru

@ngbomford I'm asking because it is not entirely clear from your post above. So just to be sure, with the configuration above, or only after system suspend? Are you using drivers from nvidia installer or from pacman package nvidia/nvidia-open? Does changing to nvidia-open affect the issue? If you change the threshold to 200 or something larger does the GPU oscilate between d0 d3cold on boot and after system suspend it sleeps correctly for the first time but after wakeup (eg by nvidia-smi) the bug comes back? If possible can you attach an output of cat /proc/driver/nvidia/params and systemd-analyze cat-config modprobe.d | grep nvidia?

@imaGuru Switches to D3cold 10-15 seconds after boot and always functions correctly, even after suspending / unplugging power, powering back up it switches to d3cold again perfectly. Running anything that accesses the dGPU wakes it up, after the app closes it goes back to D3cold correctly. I've tried everything to trigger the bug, but can't with this config.

pacman / nvidia closed package works correctly with my config. Using the nvidia-open package results in d3cold-d3 alternating, and I've tried so many different configs to make it work I gave up.

If I change the threshold to 200 or anything else, it results in d3-d3cold alternating.

/proc/driver/nvidia/params

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 1
EnableS0ixPowerManagement: 1
S0ixPowerManagementVideoMemoryThreshold: 1024
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 0
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
RmNvlinkBandwidthLinkCount: 0
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 1
DmaRemapPeerMmio: 1
ImexChannelCount: 2048
CreateImexChannel0: 0
GrdmaPciTopoCheckOverride: 0
CoherentGPUMemoryMode: ""
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: "/var/tmp"
ExcludedGpus: "

systemd-analyze cat-config modprobe.d | grep nvidia

# /usr/lib/modprobe.d/nvidia-sleep.conf
# https://download.nvidia.com/XFree86/Linux-x86_64/560.35.03/README/powermanagement.html#PreserveAllVide719f0
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var/tmp
# /usr/lib/modprobe.d/nvidia-utils.conf
# /etc/modprobe.d/nvidia.conf
options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=0
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_S0ixPowerManagementVideoMemoryThreshold=1024

ngbomford avatar Aug 29 '25 01:08 ngbomford