open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

[555.42.02] D3cold on Turing Mobile not working with kernel 6.9.2. Works with closed driver.

Open dagbdagb opened this issue 1 year ago • 13 comments

NVIDIA Open GPU Kernel Modules Version

550.78

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [X] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Gentoo Linux x86_64 6.7.9-gentoo

Kernel Release

6.7.9-gentoo, own config

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [X] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 2070 with Max-Q Design

Describe the bug

I noticed my laptop was slightly warmer than expected. This on 6.8.9-gentoo. A number of reboots later, I can state that :

dagb@gillette:~ (20:32) $ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power 
Runtime D3 status:          Not supported
Video Memory:               Active

GPU Hardware Support:
 Video Memory Self Refresh: Not Supported
 Video Memory Off:          Supported

... is the result, if the nvidia-drivers package is built with the kernel-open flag in gentoo, running gentoo-sources-6.7.9.

If built with -kernel-open (leading '-' implies 'no') I have fine-grained control again.

HOWEVER, please also note: I also tried both variants (open/closed kernel driver) on 6.8.9, and there I get 'Not supported' in both cases'.

I have not bisected the issue to a particular kernel version. I just happened to have 6.7.9 on disk.

To Reproduce

  • run gentoo
  • install gentoo-sources-6.7.9
  • build/install kernel
  • build install nvidia-driver:
[ebuild   R    ] x11-drivers/nvidia-drivers-550.78:0/550::gentoo  USE="X kernel-open modules static-libs strip tools wayland -dist-kernel -modules-compress -modules-sign -persistenced -powerd" ABI_X86="(64) -32" 0 KiB
- driver options:

blacklist nouveau options nvidia-drm modeset=0 options nvidia NVreg_DeviceFileGID=27 options nvidia NVreg_DeviceFileMode=432 options nvidia NVreg_DeviceFileUID=0 options nvidia NVreg_ModifyDeviceFiles=1 options nvidia NVreg_TemporaryFilePath=/var/tmp options nvidia NVreg_PreserveVideoMemoryAllocations=0 options nvidia NVreg_DynamicPowerManagement=0x02 options nvidia NVreg_EnableS0ixPowerManagement=1 options nvidia NVreg_UsePageAttributeTable=1 alias char-major-195 nvidia alias /dev/nvidiactl char-major-195 remove nvidia modprobe -r --ignore-remove nvidia-drm nvidia-modeset nvidia-uvm nvidia


- udev rules:

dagb@gillette:~ (20:45) $ cat /etc/udev/rules.d/80-nvidia-pm.rules

Remove NVIDIA USB xHCI Host Controller devices, if present

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c0330", ATTR{remove}="1"

Remove NVIDIA USB Type-C UCSI devices, if present

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c8000", ATTR{remove}="1"

Remove NVIDIA Audio devices, if present

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x040300", ATTR{remove}="1"

Enable runtime PM for NVIDIA VGA/3D controller devices on driver bind

ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="auto" ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="auto"

Disable runtime PM for NVIDIA VGA/3D controller devices on driver unbind

ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="on" ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="on"




### Bug Incidence

Always

### nvidia-bug-report.log.gz

[nvidia-bug-report.log.gz](https://github.com/NVIDIA/open-gpu-kernel-modules/files/15287343/nvidia-bug-report.log.gz)


### More Info

I *think* 6.6.30 works with both open and closed kernel driver. Will give it another spin to verify, and update this ticket with the result.

dagbdagb avatar May 12 '24 18:05 dagbdagb

Right. So 6.6.30 also fails with the open driver. If this always was the case, then the bug appears to be with the closed driver. And whatever we are looking for happened between kernel 6.7.9 and 6.8.9. sigh. And the entire ticket belongs somewhere else, I presume?

dagbdagb avatar May 12 '24 19:05 dagbdagb

And the entire ticket belongs somewhere else, I presume?

Yes, here: https://forums.developer.nvidia.com/c/gpu-graphics/linux/148

ttabi avatar May 12 '24 22:05 ttabi

Seeing how this still is open, I might as well continue here.

In the light of this driver being considered as the default in the linux nvidia-drivers, I would like to point out that in order to get RTD3/D3cold working with my Turing 2070 mobile, I must:

  • use the proprietary kernel driver
  • disable loading the GPU firmware

Any other combination ends up with "Runtime D3 status: Not supported".

This applies to kernel version 6.9.2-gentoo and nvidia-drivers 555.42.02.

I will happily provide an updated nvidia-bug-report.log.gz if required. If so, let me know if you want it with a particular combo of driver and driver options enabled.

dagbdagb avatar Jun 01 '24 18:06 dagbdagb

Since you're on gentoo, can you try 6.1.x kernels ? (especially this one since it works for me with this version 6.1.92)

I seem to have some issues with D3cold aswell.

XutaxKamay avatar Jun 21 '24 23:06 XutaxKamay

I can, but is there any point to it? 6.1 is a longterm kernel, sure. But so is 6.6, which is way more recent. Also, try what exactly? Open driver with GPU firmware loading? Does this combo enable D3cold for you? And if so, does it still enter D3cold after a suspend cycle?

dagbdagb avatar Jun 22 '24 09:06 dagbdagb

Hey there, sorry for the late reply! In the driver readme kernel_open section it says:

Known Issues The following are some known limitations of the open kernel modules versus the proprietary kernel modules with GSP firmware mode disabled: ...

  • Run Time D3 (RTD3) is only supported on Ampere and above GPUs.

This isn't a "bug that needs fixing" kind of issue, it's more of a "feature is entirely missing and needs to be coded from scratch". Unlike Ampere+, the proprietary non-GSP implementation of Turing RTD3 doesn't map well to GSP and would require a large effort to enable. I can't give any ETA or anything, but considering that this was never a default-enabled feature even on proprietary, I imagine the priority is gonna be lower than other regressions.

In the meantime, you might want to stay with the proprietary driver with GSP disabled if this is a dealbreaker for you.

Thanks for understanding.

mtijanic avatar Jun 24 '24 13:06 mtijanic

Hey there, sorry for the late reply! In the driver readme kernel_open section it says:

Known Issues The following are some known limitations of the open kernel modules versus the proprietary kernel modules with GSP firmware mode disabled: ...

  • Run Time D3 (RTD3) is only supported on Ampere and above GPUs.

This isn't a "bug that needs fixing" kind of issue, it's more of a "feature is entirely missing and needs to be coded from scratch". Unlike Ampere+, the proprietary non-GSP implementation of Turing RTD3 doesn't map well to GSP and would require a large effort to enable. I can't give any ETA or anything, but considering that this was never a default-enabled feature even on proprietary, I imagine the priority is gonna be lower than other regressions.

In the meantime, you might want to stay with the proprietary driver with GSP disabled if this is a dealbreaker for you.

Thanks for understanding.

I see.

The effort required is with the firmware, is that it? And yes, dropping the laptop power consumption with 5-6W is fairly essential. Both for the heat and the fan noise.

Any chance of nvidia publishing a live list of items being worked on / prioritized for the next driver release?

dagbdagb avatar Jun 24 '24 14:06 dagbdagb

Any chance of nvidia publishing a live list of items being worked on / prioritized for the next driver release?

Honestly? No, no chance. Hard enough to come by that information internally even, but also aside from that historically we've had a very bad time when these publicly shared ETAs slip even by just a few days.

I'm afraid the only straight answer you're gonna get is roughly: "Known issue. Not easy fix. No ETA. Low priority. Here's a workaround (proprietary+disable GSP)". Anything else I could say would be so full of weasel words that it might as well be left unsaid.

Sorry, I know it's not what you want to hear, but it is what it is.

mtijanic avatar Jun 24 '24 16:06 mtijanic

Sorry, I know it's not what you want to hear, but it is what it is.

You're right, @mtijanic . Hate the message, appreciate the messenger.

So, to sum it up:

  • the GSP firmware for Turing does not support RTD3 at all
  • the non-GSP way of enabling RTD3 for Turing does not match well with GSP
  • RTD3 on Turing requires the proprietary driver, with GSP disabled
  • this may possibly never be fixed

Bah.

For anyone else finding this: Even with the proprietary driver and GSP disabled, RTD3 on Turing is finicky. A suspend/resume cycle may in some cases cause the card to not enter D3cold again.

dagbdagb avatar Jun 25 '24 06:06 dagbdagb

Edit: This seems to be a weird sysfs thing; I was looking at the wrong file (/sys/class/drm/card1/device/power/runtime_status [correct] vs /sys/class/drm/card1/power/runtime_status [reports something else, apparently]). Runtime PM is indeed enabled, but doesn't work for... reasons?

Moar Edit: If you value your battery life, do not set nvidia_drm.fbdev=1.

Original: Unless I'm missing something critical (which I may well be), this issue now seems to affect the proprietary kernel modules as well. I've been running an Nvidia-driven display on my hybrid-GPU laptop until very recently, so I can't say exactly when things changed, but here's what I'm currently seeing on the v555.58.02 proprietary modules:

$ modinfo nvidia | rg license
license:        NVIDIA

$ modprobe nvidia --showconfig | rg NVreg
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia "NVreg_EnableGpuFirmware=0"
options nvidia "NVreg_DynamicPowerManagement=0x02"

$ cat /sys/class/drm/card1/power/runtime_status
unsupported

And, indeed, the GPU stays in D0 even when it has been able to switch to D3Cold previously (unplugged from wall power, no external display connected, no programs using it).

Is this a known/expected regression?

LRitzdorf avatar Aug 21 '24 00:08 LRitzdorf

@LRitzdorf

Is this a known/expected regression?

I don't think so, with options nvidia "NVreg_EnableGpuFirmware=0"; can you verify it is actually disabled? Run:

nvidia-smi -q | grep GSP

If it gives you N/A it's disabled, and if it gives a version number then that param had no effect.

Anyway, if it is actually disabled, please shoot a bug report to [email protected], since it has nothing to do with this repo here.

mtijanic avatar Aug 21 '24 11:08 mtijanic

This information was rather annoying to find (thanks Arch Wiki for actually linking to it). Moving back to the proprietary kernel driver for the foreseeable future.

qwertychouskie avatar Oct 15 '25 22:10 qwertychouskie

@mtijanic

can you verify it is actually disabled?

With nvidia-open GSP firmware cannot be disabled:

$ modinfo nvidia | rg license
license:        Dual MIT/GPL
$ modprobe nvidia --showconfig | rg NVreg
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var/tmp
options nvidia "NVreg_DynamicPowerManagement=0x02"
options nvidia NVreg_EnableGpuFirmware=0
$ nvidia-smi -q | grep GSP
    GSP Firmware Version                  : 580.105.08
$ cat /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status:          Not supported
Video Memory:               Active

GPU Hardware Support:
 Video Memory Self Refresh: Not Supported
 Video Memory Off:          Supported

S0ix Power Management:
 Platform Support:          Not Supported
 Status:                    Disabled

Notebook Dynamic Boost:     Not Supported

With closed driver GSP firmware is disabled:

$ modinfo nvidia | rg license
license:        NVIDIA
$ modprobe nvidia --showconfig | rg NVreg
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var/tmp
options nvidia "NVreg_DynamicPowerManagement=0x02"
options nvidia NVreg_EnableGpuFirmware=0
$ nvidia-smi -q | grep GSP
    GSP Firmware Version                  : N/A
$ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power
Runtime D3 status:          Enabled (fine-grained)
Video Memory:               Active

GPU Hardware Support:
 Video Memory Self Refresh: Supported
 Video Memory Off:          Supported

S0ix Power Management:
 Platform Support:          Not Supported
 Status:                    Disabled

Notebook Dynamic Boost:     Not Supported

Markus00000 avatar Nov 21 '25 07:11 Markus00000