open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

RTD3 dont allow gpu to sleep after a monitor has been plugged and unplugged on prime reverse sync

Open Aetherall opened this issue 11 months ago • 16 comments

NVIDIA Open GPU Kernel Modules Version

NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 565.77 srcversion: 0BDAE46B2642DAFAAF16C9C

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Nixos unstable

Kernel Release

Linux 6.12.6 NixOS SMP PREEMPT_DYNAMIC Thu Dec 19 17:13:24 UTC 2024 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU

Describe the bug

GPU power state locked in D0 after a monitor has been plugged and unplugged to the gpu.

I am using reverse sync on a advanced optimus laptop, to allow hotplug monitors in the gpu hdmi/dp ports.

After a clean boot, and before plugging an external monitor, the nvidia gpu is in D3cold state.

> cat /sys/class/drm/card*/device/power_state
D0
D3cold

I can power on the nvidia card using nvidia-smi or by running glxinfo, and the power state will switch to D0 before going back to sleep as intented. The issue arises when a monitor is plugged to the gpu. this will wake the gpu to allow reverse prime to render to the external display. However, when unplugging the monitor, the power state will never leave D0. (no process running on the gpu)

I found related posts on the nvidia forums, https://forums.developer.nvidia.com/t/nvidia-dgpu-in-hybrid-optimus-laptop-not-powering-down-after-unplugging-external-monitor/318196 https://forums.developer.nvidia.com/t/565-release-feedback-discussion/310777/154 https://forums.developer.nvidia.com/t/565-release-feedback-discussion/310777/41 and most importantly https://forums.developer.nvidia.com/t/bug-linux-driver-fails-to-remove-framebuffer-device-when-hdmi-cable-plugged-out/316645

In the last one, gm151 noticed that the framebuffer created when plugging the monitor is never cleaned up.

When the cable is plugged in a new framebuffer device is created as it should, however when the cable is plugged out, the device is NOT removed even with no clients using it. This has several negative consequences:

  • If the virtual console is remapped to the new framebuffer, then after plugging out, the console is NOT remapped back to the integrated GPU. (This can be inhibited by passing fbcon=map:0, however this does not help the framebuffer to get removed)
  • The DGPU device fails to enter D3cold state and consumes power.
    Here are some facts from the kernel’s sysfs. Note this is WITHOUT any graphical environment running, pure text console, ruling out the graphical env as a culprit. Also, a drm journal log was emitted on plug [drm] fb1: nvidia-drmdrmfb frame buffer device but nothing on unplug.

I am facing the same situation and did some more testing.

I can indeed see that the frame buffer at /dev/fb1 ( /sys/class/graphics/fb1 ) is created when the monitor is plugged in and not removed on unplug.

Reloading the nvidia-drm module allow to go back in sleep mode: modprobe -r nvidia-drm && modprobe nvidia-drm I notice that the ghost framebuffer is removed afterwards, maybe it is what allows RTD3 to kick in.

dumping the framebuffers using cat /sys/kernel/debug/dri/12{8,9}/framebuffer shows that:

  • on fresh boot with d3cold and no monitor, the nvidia related dri framebuffer is empty (empty file) whereas the other contains a framebuffer allocated by fbcon

  • after plugging in a monitor, the nvidia dri framebuffer now contains a framebuffer allocated by fcon too, with a layer size corresponding to the monitor.

  • after unplugging the monitor, the framebuffer does not go back to an empty file, and stays allocated by fbcon

I tried every combination of kernel parameters / module options to no avail. The more I tried with are:

options nvidia NVreg_EnableGpuFirmware=0 NVreg_DynamicPowerManagementVideoMemoryThreshold=0 NVreg_DynamicPowerManagement=0x02 NVreg_UsePageAttributeTable=1 NVreg_InitializeSystemMemoryAllocations=0 NVreg_PreserveVideoMemoryAllocations=1
options nvidia-drm modeset=1 fbdev=1

I also tried several linux kernel versions.

It seems like the code responsible for fbdev moved recently in the linux kernel and in the oppen-gpu-kernel-modules, I saw changes related to hotplugs events, so maybe we are missing a handler to cleanup the framebuffers ?

Thanks !

To Reproduce

  • on a advanced optimus laptop on hybrid mode
  • unplug external monitors
  • boot to tty with reverse prime and fine-grained power control enabled
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D3cold
  • ls /dev/fb* -> should show only the fb0 ( integrated graphics -> internal monitor )
  • nvidia-smi -> boot the gpu
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D3cold
  • here we know that RTD3 works
  • plug an external monitor
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D0
  • here we know the gpu can wake for a monitor
  • unplug the external monitor
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D3cold <- never happens

Bug Incidence

Always

nvidia-bug-report.log.gz

I have tested 20+ module option combination, which options are the most interesting to generate the report with ?

More Info

No response

Aetherall avatar Dec 29 '24 19:12 Aetherall

Hi @Aetherall,

We are tracking this in bug 5034343 internally. I am very interested in this issue myself. My original opinion on this issue is that it should have persisted across various kernel versions, independent of the recent fbdev API refactors.

It seems like the code responsible for fbdev moved recently in the linux kernel and in the oppen-gpu-kernel-modules, I saw changes related to hotplugs events, so maybe we are missing a handler to cleanup the framebuffers ?

My expectation is that the hotplug event handler is enumerated by the fbdev core API. I have an explanation of this here.

If you use a LTS kernel like kernel 6.6, are you saying that you do not have this issue? I would be surprised if that was the case.

If so, could you provide two bug collection reports? One with the 6.6 kernel and one with 6.12 kernel using the same driver version.

If you would not mind following up with abchauhan on the NVIDIA forum post, that would be appreciated as well. That way, I can be provided with a repro setup. In theory, I should be able to reproduce this with my work laptop, but it helps us with overall process.

Binary-Eater avatar Dec 30 '24 21:12 Binary-Eater

Hi @Binary-Eater thanks for the followup !

I initially had the issue on LTS kernel, and later upgraded to 6.12 to see if it would fix the issue. I have not reverted back as this kernel version contains other unrelated improvements I want to keep.

I will followup with abchauhan asap and provide the gz logfile and a reproducible environment, however I wont be available for new years eve so it might take few days.

Meanwhile here is my nvidia nixos configuration if you want to reproduce it on the same os as well.

{
  config,
  pkgs,
  lib,
  ...
}: {
  boot.kernelPackages = pkgs.linuxPackages_latest;
  hardware.graphics.enable = true;
  powerManagement.enable = true;

  services.auto-cpufreq.settings = {
    battery = {
      governor = "powersave";
      turbo = "auto";
    };
    charger = {
      governor = "performance";
      turbo = "auto";
    };
  };

  hardware.nvidia = {
    modesetting.enable = true;

    powerManagement.enable = true;

    dynamicBoost.enable = true;
    nvidiaPersistenced = true;

    open = true;
    nvidiaSettings = true;
    package = config.boot.kernelPackages.nvidiaPackages.beta;
  };

  services.udev.extraRules = ''
    # Create consistent gpu devices symlinks
    ACTION=="bind", SUBSYSTEM=="pci", ATTRS{vendor}=="0x8086", ATTR{class}=="0x030000", RUN+="${pkgs.coreutils-full}/bin/ln -s /dev/dri/by-path/pci-0000:00:02.0-card /dev/gpu_intel"
    ACTION=="bind", SUBSYSTEM=="pci", ATTRS{vendor}=="0x10de", ATTR{class}=="0x030000", RUN+="${pkgs.coreutils-full}/bin/ln -s /dev/dri/by-path/pci-0000:01:00.0-card /dev/gpu_nvidia"
  '';

  services.xserver.videoDrivers = ["nvidia"];

  environment.sessionVariables.AQ_DRM_DEVICES = "/dev/gpu_nvidia";
  environment.sessionVariables.VK_ICD_FILENAMES = "/run/opengl-driver/share/vulkan/icd.d/nvidia_icd.x86_64.json";
  environment.sessionVariables.GBM_BACKEND = "nvidia-drm";
  environment.sessionVariables.LIBVA_DRIVER_NAME = "nvidia";
  environment.sessionVariables.__GLX_VENDOR_LIBRARY_NAME = "nvidia";

  specialisation = {
    powersave.configuration = {
      system.nixos.tags = ["powersave"]; # this specialisation have the RTD3 issue
      hardware.nvidia = {
        powerManagement.enable = true;
        powerManagement.finegrained = true;
        prime = {
          offload.enable = true;
          offload.enableOffloadCmd = true;
          reverseSync.enable = true;
          intelBusId = "PCI:0:2:0";
          nvidiaBusId = "PCI:1:0:0";
        };
      };
      environment.sessionVariables.AQ_DRM_DEVICES = lib.mkForce "/dev/gpu_intel:/dev/gpu_nvidia";
      environment.sessionVariables.VK_ICD_FILENAMES = lib.mkForce "";
      environment.sessionVariables.GBM_BACKEND = lib.mkForce "";
      environment.sessionVariables.LIBVA_DRIVER_NAME = lib.mkForce "";
      environment.sessionVariables.__GLX_VENDOR_LIBRARY_NAME = lib.mkForce "";
    };
  };
}

Happy new year !

Aetherall avatar Dec 31 '24 16:12 Aetherall

Yeah I'm seeing this on an RTX 4060 mobile. Maybe nvidia-smi --gpu-reset can sort of mitigate this problem?

Kimiblock avatar Jan 18 '25 03:01 Kimiblock

To provide a quick update, I believe I have brainstormed a solution to the problem seen here. I have not yet verified it resolves the issue, but theoretically it should enable the GPU to now sleep when all DRM connectors are disconnected due to hotplugs.

Binary-Eater avatar Jan 25 '25 19:01 Binary-Eater

I am having the same issue. My laptop and linux specs:

OS: Arch Linux x86_64
Host: ROG Zephyrus G15 GA503QS_GA503QS (1.0)
Kernel: Linux 6.13.2-arch1-1
DE: KDE Plasma 6.2.5
WM: KWin (Wayland)
CPU: AMD Ryzen 9 5900HS (16) @ 4.68 
GPU 1: NVIDIA GeForce RTX 3080 Mobile / Max-Q 8GB/16GB [Discrete]
GPU 2: AMD Radeon Vega Series / Radeon Vega Mobile Series [Integrated]

dGPU can sleep when I reboot, but when I go from plugged into my monitors, to unplugged, it wont sleep until I restart the laptop

HunterWhiteDev avatar Feb 16 '25 20:02 HunterWhiteDev

Will Ampere GPUs enter RTD3 after a sleep/wakeup cycle if:

  • the GPU has been awake after initial power-up
  • no monitor has been attached yet

I know Turing GPUs do not, and was informed this was an issue with the (Turing-specific) firmware used with the open source modules. Would be very satisfying to know that the Turing issue isn't related to firmware which is unlikely to be fixed....

dagbdagb avatar Feb 19 '25 11:02 dagbdagb

Putting the comment here as well from the linked issue.

Hi there, the issue I'm facing is since the beginning of the ownership of my laptop. Ryzen 9 7945HX + RTX 4060, Lenovo Legion Pro 5

The problem is that when I try to sleep with a plugged in external monitor (doesn't matter if it is the built-in HDMI or Type-C to DisplayPort adapter), the monitor signal itself wakes up the laptop again, other devices power off for a brief second (tested with mouse, external SSD) but after a second the machine wakes up.

These are the options

options nvidia_drm modeset=1

options nvidia_drm fbdev=1

options nvidia NVreg_EnableGpuFirmware=1

options nvidia NVreg_PreserveVideoMemoryAllocation=1

options nvidia NVreg_TemporaryFilePath=/tmp/nvidia

options nvidia NVreg_DynamicPowerManagement=0x02

options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=0

To Reproduce Plug in a monitor into the HDMI or Type-C to DisplayPort adapter and try to put the machine to sleep

412611686-17fd56f3-d98a-4b38-8637-d5cc1199df17.png

This is what acpi_listen tells me what is happening, the laptop has a MUX switch, the iGPU is disabled from the BIOS.

AleksandarBayrev avatar Feb 19 '25 11:02 AleksandarBayrev

@Binary-Eater Is this still something you're brain storming on solving? Just curious as this issue still bugs me

HunterWhiteDev avatar Jul 26 '25 01:07 HunterWhiteDev

@HunterWhiteDev unfortunately,the changes are somewhat involved. Still working through them.

Does loading nvidia-drm with the fbdev=0 parameter not work around the issue for you in the meantime? If so, the parameter can be persisted across reboots.

Binary-Eater avatar Jul 26 '25 07:07 Binary-Eater

Driver version 535.247.01 on kernel 6.14.6 enables D3cold for Turing graphics. Even after a suspend cycle. I am using these options at the moment:

options nvidia-drm modeset=0 options nvidia NVreg_DeviceFileGID=27 options nvidia NVreg_DeviceFileMode=432 options nvidia NVreg_DeviceFileUID=0 options nvidia NVreg_ModifyDeviceFiles=1 options nvidia NVreg_TemporaryFilePath=/var/tmp options nvidia NVreg_EnableResizableBar=1 options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_DynamicPowerManagement=0x02 options nvidia NVreg_EnableS0ixPowerManagement=0 options nvidia NVreg_UsePageAttributeTable=1 options nvidia NVreg_EnablePCIeGen3=1 options nvidia NVreg_EnableGpuFirmware=0 options nvidia NVreg_OpenRmEnableUnsupportedGpus=1 options nvidia NVreg_NvLinkDisable=1

Edit: I see this comment was unrelated to the version at hand. Apologies, I am subscribed to a number of RTD3 tickets and mistook this for another issue.

I am still leaving my comment in case it is of use to another soul. Never forget denvercoder9.

dagbdagb avatar Jul 26 '25 08:07 dagbdagb

@HunterWhiteDev unfortunately,the changes are somewhat involved. Still working through them.

Does loading nvidia-drm with the fbdev=0 parameter not work around the issue for you in the meantime? If so, the parameter can be persisted across reboots.

This does seem to work for me actually! Haven't tested it too much, but setting fbdev=0 seems to allow my dGPU to suspend when I unplug my monitors. Thanks!

HunterWhiteDev avatar Jul 26 '25 13:07 HunterWhiteDev

Thanks @HunterWhiteDev. The changes I am proposing will enable the same while leaving fbdev=1.

Binary-Eater avatar Jul 26 '25 18:07 Binary-Eater

nvidia-drm.fbdev=0 solves the issue for me as well

mradalbert avatar Aug 10 '25 08:08 mradalbert

True for me too. Though my laptop won't properly suspend anymore and REISUB doesn't work at all.

Kimiblock avatar Aug 10 '25 16:08 Kimiblock

@Binary-Eater Loading nvidia-drm with fbdev=0 works for me too with RTX 5080 laptop GPU and 580 driver. Suspend works normally as well, for some reason. Thanks a lot.

Kelsios avatar Nov 17 '25 14:11 Kelsios

We have a fix for this with fbdev=1 on the way. Sorry this took so long.

Binary-Eater avatar Nov 17 '25 15:11 Binary-Eater