egl-wayland icon indicating copy to clipboard operation
egl-wayland copied to clipboard

egl needs an early out to prevent waking the dGPU unnecessarily

Open flukejones opened this issue 1 year ago • 37 comments

On the last two/three years of hybrid laptops, notably Nvidia RTX20xx++ onwards these machines tend to have a better/deeper suspend function which puts the dgpu in to a very low power state when unused.

Combined with glvnd, this introduces a lag or 1-2 seconds while the dgpu wakes in response to queries. Even if it remains unused and the iGPU is used instead. For example opening Nautilus file manager is delayed 1-2s while the dGPU wakes. For a lot of apps that use glvnd this ends up being a bad UX.

A lot of folks are working around this with __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json.

I reported this here some time ago

flukejones avatar Sep 29 '23 09:09 flukejones

yeah, it hurts battery life too

having the gpu wakeup and blast it's fans every time an app is open

CosmicFusion avatar Oct 18 '23 18:10 CosmicFusion

This should be fixed by https://github.com/NVIDIA/egl-wayland/commit/ba6c38ad74cf0ef6ec4d7934f68c17a7a2d460ca

erik-kz avatar Oct 18 '23 18:10 erik-kz

This should be fixed by ba6c38a

Seems a bit hit and miss, but this is likely to be due to how some apps (like Firefox, Vscode, Geary, Evolution) maybe handle GPU stuff. These apps will still wake the GPU, but other apps like Nautilus no-longer do this.

flukejones avatar Oct 19 '23 20:10 flukejones

  • Nautilus 45 still opens the GPU with the latest egl-wayland

FiestaLake avatar Oct 20 '23 04:10 FiestaLake

Seems a bit hit and miss, but this is likely to be due to how some apps (like Firefox, Vscode, Geary, Evolution) maybe handle GPU stuff. These apps will still wake the GPU, but other apps like Nautilus no-longer do this.

Nautilus 45 still opens the GPU with the latest egl-wayland

I see the same behaviour as the first comment with applications such as VSCode (even when using the Wayland backend), but not the last: GTK4 apps that were previously problematic such as Nautilus now no longer start the GPU or have the noticeable delay spinning up - also confirmed by monitoring the dGPU state using watch cat /sys/class/drm/card*/device/power_state.

Might be worth mentioning for completeness that if the app in question is running in Flatpak, it's not yet fixed likely because the newest release of this library hasn't landed in the base runtimes yet.

Gert-dev avatar Oct 20 '23 09:10 Gert-dev

  • Nautilus 45 still opens the GPU with the latest egl-wayland

https://youtu.be/gKYoFEvtUJ4

FiestaLake avatar Oct 20 '23 10:10 FiestaLake

Yeah, anything with Flatpak would need an update to its runtime environment to pick up an updated egl-wayland library.

It might be possible to work around that by using flatpak override --filesystem to map the host's copy of libnvidia-egl-wayland.so.1 through to the container, though at that point it's probably easier to just use the __EGL_VENDOR_LIBRARY_FILENAMES workaround instead.

For other applications, if the app itself (or some other library) tries to call eglQueryDevicesEXT on its own, then it would run into the same problem. Firefox might do that, but I couldn't say for sure -- I think the last time I looked at Firefox's GL code was before Wayland even existed. It would surprise me if something like Geary or Evolution did that, though.

kbrenneman avatar Oct 20 '23 11:10 kbrenneman

Now that I think about it, if an application calls eglGetDisplay(NULL), or eglGetPlatformDisplay with EGL_PLATFORM_DEVICE_EXT or EGL_PLATFORM_SURFACELESS_MESA then that would also cause the NVIDIA GPU to wake up.

All of those would produce a headless EGLDisplay, without a windowing system associated with it. And without a windowing system, the driver has no way to know which device is driving the desktop.

kbrenneman avatar Oct 20 '23 11:10 kbrenneman

https://youtu.be/gKYoFEvtUJ4

That's indeed weird - for me it doesn't bring the dGPU out of the D3Cold state. Since I'm assuming Nautilus isn't the experimental Flatpak version, could it be that you have some kind of specific configuration in place that makes the NVIDIA GPU your primary (card0) one? I notice that for me NVIDIA dGPU is card1 and the Intel iGPU card0. Not sure if this has impact anywhere.

For other applications, if the app itself (or some other library) tries to call eglQueryDevicesEXT on its own, then it would run into the same problem. ...

That indeed makes sense, I assume in these cases we'd need to create the relevant issue reports for those projects separately since this is out of egl-wayland's hands?

Firefox and Electron make some sense because IIRC they also handle some iGPU/dGPU 'placement' for things such as WebGL, so it wouldn't surprise me if the underlying code is also querying the available GPUs for that.

I'm also wondering, though, if these specific remaining issues are then also a problem for hybrid GPU setups with an AMD or even Intel dGPU? I have none to test currently, but it might be interesting to mention in upstream reports and make it more testable for developers.

Gert-dev avatar Oct 20 '23 13:10 Gert-dev

That indeed makes sense, I assume in these cases we'd need to create the relevant issue reports for those projects separately since this is out of egl-wayland's hands?

Most likely, yes. If an app actually does just need to do offscreen rendering, though, then there isn't really a good way to do that without running into this. Either it calls something like eglGetDisplay(NULL) and lets implementation pick a device (which would result the NVIDIA driver wake up a GPU), or it would use EGL_EXT_platform_device or EGL_EXT_explicit_device, which would require calling eglQueryDevicesEXT anyway.

I'm also wondering, though, if these specific remaining issues are then also a problem for hybrid GPU setups with an AMD or even Intel dGPU? I have none to test currently, but it might be interesting to mention in upstream reports and make it more testable for developers.

Hard to say. If the driver for the dGPU is Mesa, then it would depend on how Mesa handles device enumeration and selection internally.

kbrenneman avatar Oct 20 '23 14:10 kbrenneman

I wonder if the GPU offloading configuration proposal for libglvnd could help with this?

Most of the design for that would be about right, but I'll have to think about if I could tweak that interface to avoid unnecessary internal eglQueryDeviceEXT calls.

kbrenneman avatar Oct 20 '23 14:10 kbrenneman

https://youtu.be/gKYoFEvtUJ4

That's indeed weird - for me it doesn't bring the dGPU out of the D3Cold state. Since I'm assuming Nautilus isn't the experimental Flatpak version, could it be that you have some kind of specific configuration in place that makes the NVIDIA GPU your primary (card0) one? I notice that for me NVIDIA dGPU is card1 and the Intel iGPU card0. Not sure if this has impact anywhere.

Yes, it's the native nautilus package from Arch. In my case, most of times NVIDIA dGPU is card0 and the AMD iGPU is card1, though sometimes reversion happens. Haven't done any changes.

FiestaLake avatar Oct 21 '23 06:10 FiestaLake

It just occurred to me that the NVIDIA GBM library has the same problem of calling eglQueryDevices right away to try to find a matching device, so anything that tries to use EGL_KHR_platform_gbm would run into this as well. I'd be surprised if any application actually used both EGL_KHR_platform_gbm and EGL_KHR_platform_wayland, though.

But, disabling one or both of the wayland and GBM platform libraries would be a way to determine if the application is doing something directly to access an NVIDIA device, or if that's still coming from one of the platform libraries.

The __EGL_EXTERNAL_PLATFORM_CONFIG_DIRS and __EGL_EXTERNAL_PLATFORM_CONFIG_FILENAMES environment variables can control which platform libraries get loaded, like so:

# Disable all platform libraries
__EGL_EXTERNAL_PLATFORM_CONFIG_DIRS=/some/nonexistant/path /path/to/program
# Only load the GBM platform library
__EGL_EXTERNAL_PLATFORM_CONFIG_FILENAMES=/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json /path/to/program

kbrenneman avatar Oct 21 '23 13:10 kbrenneman

I'd be surprised if any application actually used both EGL_KHR_platform_gbm and EGL_KHR_platform_wayland, though.

I believe recent versions of WebKit will do this. The web process uses GBM while the GUI process uses Wayland or X11.

erik-kz avatar Oct 21 '23 19:10 erik-kz

Hi, I've noticed that some apps are broken when applying ICD json file order workaround, either they are not opening: qflipper

Or partially broken with some UI elements not being displayed: egl_wa

Temporarily removing WA makes everything work again (except for waking up NVIDIA GPU): egl_no_wa

Is it something related to those apps/flatpak runtime? Or is it also a bug in EGL?

marcinx64 avatar Oct 24 '23 12:10 marcinx64

Is it something related to those apps/flatpak runtime? Or is it also a bug in EGL?

That depends -- what's the contents of that egl_vendor.d directory?

kbrenneman avatar Oct 24 '23 13:10 kbrenneman

Right now it looks like this (those are copies from default directory on host):

ls ~/.local/usr/share/glvnd/egl_vendor.d/ 50_mesa.json 60_nvidia.json

Basically there is no difference if I use "__EGL_VENDOR_LIBRARY_FILENAMES" and specify mesa ICD json file first, or use "__EGL_VENDOR_LIBRARY_DIRS" and point to another dir with changed filename for nvidia (10_nvidia.json -> 60_nvidia.json), the issue is the same.

marcinx64 avatar Oct 24 '23 13:10 marcinx64

I'd need to know more about what the application is trying to do to be sure, but my best guess is that it's using an offscreen EGLDisplay, but there's something in Mesa that it can't cope with. Calling something like eglGetDisplay(NULL) will generally hand back an EGLDisplay from whatever vendor library is first.

If you use __EGL_VENDOR_LIBRARY_FILENAMES to limit it to only load Mesa, do you get the same problem?

kbrenneman avatar Oct 25 '23 21:10 kbrenneman

If you use __EGL_VENDOR_LIBRARY_FILENAMES to limit it to only load Mesa, do you get the same problem?

Tried, unfortunately it is the same behaviour as using __EGL_VENDOR_LIBRARY_DIRS or __EGL_VENDOR_LIBRARY_FILENAMES "reversed".

I'd need to know more about what the application is trying to do to be sure

I can help with this if I would know what You want to check, any specific command output? My system is: Fedora Silverblue 39 Kernel 6.5.6 Nvidia driver 535.113.01 egl-wayland 1.1.12

marcinx64 avatar Oct 27 '23 20:10 marcinx64

Tried, unfortunately it is the same behaviour as using __EGL_VENDOR_LIBRARY_DIRS or __EGL_VENDOR_LIBRARY_FILENAMES "reversed".

That's enough to confirm my guess: With Mesa as the first (or only) vendor library, the application ends up using Mesa, and something in Mesa is either failing, missing, or behaving in a way that the application can't cope with. It's probably either a simple app bug or some feature that the app needs which Mesa doesn't have.

Either way, though, that means the problem is outside egl-wayland or the nvidia driver.

kbrenneman avatar Oct 27 '23 20:10 kbrenneman

Using the search functionality in gnome shell wakes the gpu up. I kid you not.

lmao.webm

The sudden spikes in power consumption I kept experiencing might be explained by this...

jrelvas-ipc avatar Oct 31 '23 16:10 jrelvas-ipc

Using the search functionality in gnome shell wakes the gpu up. I kid you not.

That with the current version of egl-wayland?

It wouldn't surprise me if the search function spawned a new wayland client process, and if that's all it is, then commit ba6c38a should fix it.

kbrenneman avatar Oct 31 '23 16:10 kbrenneman

egl-wayland package is version 1.1.12-3.fc39. Is this the latest version?

jrelvas-ipc avatar Oct 31 '23 16:10 jrelvas-ipc

No, 1.1.13 is the one that has the fix for this: https://github.com/NVIDIA/egl-wayland/releases/tag/1.1.13

kbrenneman avatar Oct 31 '23 16:10 kbrenneman

I can attest to 1.1.13 not fixing GNOME shell (45) search waking up the dGPU for me, but, since GNOME uses search providers (GNOME characters, nautilus, ...), it seems likely that one or more of those providers are contributing to the problem by hitting one of the aforementioned paths (by accident or by underlying code being called indirectly).

Gert-dev avatar Oct 31 '23 18:10 Gert-dev

Using the search functionality of gnome shell no longer wakes up the GPU for me on egl-wayland-1.1.13-1.fc39

Fix appears to work as advertised. @kbrenneman

jrelvas-ipc avatar Nov 15 '23 20:11 jrelvas-ipc

I've reported the wake up issue on Flatpak programs to upstream: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/issues/1683

jrelvas-ipc avatar Dec 12 '23 09:12 jrelvas-ipc

I'm also wondering, though, if these specific remaining issues are then also a problem for hybrid GPU setups with an AMD or even Intel dGPU? I have none to test currently, but it might be interesting to mention in upstream reports and make it more testable for developers.

Hard to say. If the driver for the dGPU is Mesa, then it would depend on how Mesa handles device enumeration and selection internally.

For me, nouveau behaves the same as the NVIDIA proprietary driver for me here (experiencing wakeups with Chromium/-based apps, neofetch, GNOME Settings -> About panel), so it's worth noting it's an issue on that side of the fence as well

retrixe avatar Dec 21 '23 20:12 retrixe

https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/issues/1683#note_1713305231

Freedesktop upstream says that they don't ship egl-wayland separately; the binary provided by nvidia driver package is used, which is currently still at 1.1.12.

This is why flatpak programs continue to be affected by this bug.

jrelvas-ipc avatar Jan 04 '24 03:01 jrelvas-ipc

@erik-kz Is egl-wayland 1.1.13 going to be included with the next nvidia driver major release? If not, is there any timeline to do so? Asking to see if it's worth the trouble for freedesktop's runtime to package it separately.

jrelvas-ipc avatar Jan 12 '24 10:01 jrelvas-ipc