glfw icon indicating copy to clipboard operation
glfw copied to clipboard

Resizing Wayland windows is either laggy or causes the application to freeze on GNOME + AMD

Open Friz64 opened this issue 1 year ago • 11 comments

Here's a recording:

https://github.com/glfw/glfw/assets/20155479/5a096b57-586a-46c7-bb2d-666b66f83d61

This video shows the following tests:

  1. Rapidly resizing a non-GLFW wayland window, in this case my terminal, to demonstrate expected behavior
  2. Rapidly resizing the triangle-vulkan test, which shows the window significantly lagging behind the mouse cursor while resizing
  3. Rapidly resizing the title test, which shows the window increasingly becoming more laggy until the GNOME shell crashes

The larger the window size, the stronger the effect.

My GPU is the AMD Radeon RX 580.

I am running Arch Linux, although I also reproduced this on an OpenSUSE Tumbleweed GNOME live-boot.

I couldn't reproduce this on an OpenSUSE Tumbleweed KDE Plasma live-boot, as well as on my Fedora laptop running GNOME Wayland, which has Intel integrated graphics.

Friz64 avatar Feb 21 '24 23:02 Friz64

I saw something like point 3 before the recent changes but have been unable to reproduce it since. If possible, please run the title test with WAYLAND_DEBUG=1 and post the log of it below.

elmindreda avatar Feb 23 '24 10:02 elmindreda

Done! title-wayland-debug.txt

journalctl shows this:

Feb 23 14:46:31 arch gnome-shell[63235]: WL: error in client communication (pid 65121)
Feb 23 14:46:31 arch kernel: gnome-shell[63235]: segfault at 10 ip 0000791274a80c44 sp 00007ffe902d18c8 error 4 in libwayland-server.so.0.22.0[791274a7e000+8000] likely on CPU 15 (core 3, socket 0)
Feb 23 14:46:31 arch kernel: Code: 00 00 0f 1f 40 00 f3 0f 1e fa 66 48 0f 6e c7 0f 16 47 08 0f 11 06 48 89 77 08 48 8b 46 08 48 89 30 c3 0f 1f 40 00 f3 0f 1e fa <8b> 47 10 48 8b 57 40 3d ff ff ff fe 77 2e 48 83 c2 30 48 8b 4a 10
Feb 23 14:46:31 arch systemd[1]: Started Process Core Dump (PID 65144/UID 0).
Feb 23 14:46:32 arch systemd-coredump[65145]: [🡕] Process 63235 (gnome-shell) of user 1000 dumped core.

Friz64 avatar Feb 23 '24 13:02 Friz64

Oh wow, the libdecor demo application also shows this laggy resize behavior. And indeed, disabling libdecor makes this problem completely disappear. https://gitlab.freedesktop.org/libdecor/libdecor/-/issues/37

I wonder though, Blender also uses libdecor and does not have this problem. Why?

Friz64 avatar Feb 25 '24 01:02 Friz64

Whenever glfwWindowShouldClose(window) is called, it causes the entire DE to crash.

Example Source Code
#include <stdio.h>
#include <GLFW/glfw3.h>

int main() {
    if (!glfwInit()) {
        return -1;
    }

    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 2);
    glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_COMPAT_PROFILE);

    GLFWwindow* window = glfwCreateWindow(800, 600, "Hello, World!", NULL, NULL);

    if (!window) {
        printf("Failed to create a window\n");
        glfwTerminate();
        return -1;
    }

    glfwMakeContextCurrent(window);

    while (!glfwWindowShouldClose(window)) {
        glfwSwapBuffers(window);
        glfwPollEvents();
    }

    glfwTerminate();

    return 0;
}

Then build with

gcc examples/test_glfw_window.c -o examples/test_glfw_window -lglfw

Then run the binary

./examples/test_glfw_window

All I do is run the binary, then close the window, and the entire environment crashes. This happens every time I click the "x" button on the top right of the window without fail.

Checking journalctl reveals the following around the time of the crash.

journalctl crash output
Apr 07 21:03:51 spectra kernel: gnome-shell[18682]: segfault at 18 ip 00007162a4abd84c sp 00007ffc376b14d0 error 4 in libmutter-14.so.0.0.0[7162a4a3b000+19c000] likely on CPU 14 (core 6, socket 0)
Apr 07 21:03:51 spectra kernel: Code: 13 00 ff 15 66 13 1c 00 e9 93 dd ff ff 49 8b 44 24 28 48 89 85 f0 fe ff ff 48 85 c0 0f 84 e1 f9 ff ff 48 89 c7 e8 e4 2c 0b 00 <48> 8b 78 18 49 89 c4 e8 d8 ab 10 00 48 8b b5 38 ff ff ff 48 89 c7
Apr 07 21:03:51 spectra systemd[1]: Started Process Core Dump (PID 42452/UID 0).

The remaining stack trace depends on what I'm doing, even though it's rarely ever related or directly caused by it. The end of the stack trace depends on how it began.

journalctl end of stack trace
Stack trace of thread 19637:
#0  0x00007162a49190bf __poll (libc.so.6 + 0xfb0bf)
#1  0x000071620dba49b7 n/a (libpulse.so.0 + 0x339b7)
#2  0x000071620db8e45c pa_mainloop_poll (libpulse.so.0 + 0x1d45c)
#3  0x000071620db9861c pa_mainloop_iterate (libpulse.so.0 + 0x2761c)
#4  0x000071620db986d1 pa_mainloop_run (libpulse.so.0 + 0x276d1)
#5  0x000071620dba8bf2 n/a (libpulse.so.0 + 0x37bf2)
#6  0x000071620db462b7 n/a (libpulsecommon-17.0.so + 0x5c2b7)
#7  0x00007162a48a955a n/a (libc.so.6 + 0x8b55a)
#8  0x00007162a4926a3c n/a (libc.so.6 + 0x108a3c)
ELF object binary architecture: AMD x86-64

The last line is always the same.

ELF object binary architecture: AMD x86-64

Resizing the window causes artifacts and occasionally affects performance. I also get libdecor related complaints.

libdecor warning
21:56:58 | ~/Local/learn-opengl
 git:(main | Δ) λ ./examples/test_glfw_window
libdecor-gtk-WARNING: Failed to initialize GTK
Failed to load plugin 'libdecor-gtk.so': failed to init
^C

The only time I can exit the application safely is when I use ^C to signal an interrupt.

I think this issue may be related to GPU's driver. Not sure how this works at all because this is completely out of my experience.

Hardware Specs
21:50:01 | ~
  λ neofetch --color_blocks off --backend off
austin@spectra 
-------------- 
OS: Arch Linux x86_64 
Host: B650M AORUS ELITE AX 
Kernel: 6.6.25-1-lts 
Uptime: 3 hours, 49 mins 
Packages: 1566 (pacman) 
Shell: zsh 5.9 
Resolution: 1920x1080 
DE: GNOME 46.0 
WM: Mutter 
WM Theme: Adwaita 
Theme: Adwaita [GTK2/3] 
Icons: Adwaita [GTK2/3] 
Terminal: gnome-terminal 
CPU: AMD Ryzen 7 7700X (16) @ 5.573GHz 
GPU: AMD ATI 10:00.0 Raphael 
GPU: AMD ATI Radeon RX 470/480/570/570X/580/580X/590 
Memory: 57877MiB / 127957MiB

Let me know if any other information might help. I'm also willing to attempt other steps to attempt to isolate and diagnose the issue.

glxinfo -B output
22:15:19 | ~
  λ glxinfo -B                                                         
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 580 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.54, 6.6.25-1-lts) (0x67df)
    Version: 24.0.4
    Accelerated: yes
    Video memory: 8192MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 7499 MB, largest block: 7499 MB
    VBO free aux. memory - total: 63875 MB, largest block: 63875 MB
    Texture free memory - total: 7499 MB, largest block: 7499 MB
    Texture free aux. memory - total: 63875 MB, largest block: 63875 MB
    Renderbuffer free memory - total: 7499 MB, largest block: 7499 MB
    Renderbuffer free aux. memory - total: 63875 MB, largest block: 63875 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 8192 MB
    Total available memory: 72170 MB
    Currently available dedicated video memory: 7499 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX 580 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.54, 6.6.25-1-lts)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 24.0.4-arch1.2
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.0.4-arch1.2
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 24.0.4-arch1.2
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

teleprint-me avatar Apr 08 '24 02:04 teleprint-me

@Friz64 @teleprint-me Can you test your issues with mutter-git from the AUR? This will be equivalent to upgrading to Mutter 46.1 which comes out at the end of this week. I think I ran into similar issues with GLFW and updating mutter fixed it for me.

Geo25rey avatar Apr 14 '24 21:04 Geo25rey

It still behaves the same way, but instead of my desktop crashing, the GLFW window now just disappears and the application freezes.

Friz64 avatar Apr 14 '24 21:04 Friz64

@Geo25rey

Sorry for the delay. I've been busy.

  λ pacman -Ss mutter
# ...omitting packages for brevity
extra/mutter 46.1-1 [installed]
    Window manager and compositor for GNOME

I will test when I have some time.

teleprint-me avatar May 08 '24 22:05 teleprint-me

After upgrading to GNOME and mutter 46.1, it still seems to be broken for me

Geo25rey avatar May 10 '24 14:05 Geo25rey

I am experiencing the same issue now. Arch linux, gnome(wayland) 46.1-2, glfw 3.4-2. Dual graphics laptop. Intel and Nvidia. When running on Intel card it just shutters and loads 100% CPU (+50% from gnome-shell) But when I run it on Nvidia (prime-run or __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia) and try to resize the window it freezes and performs resize after a valuable time.

I use two screen setup. With only one screen things go much better, but error is still present.

Test program:

#include <GLFW/glfw3.h>

int main()
{
	glfwInit();

	GLFWwindow *window = glfwCreateWindow(800, 600, "OpenGL", NULL, NULL);
	glfwMakeContextCurrent(window);


	while(!glfwWindowShouldClose(window))
	{

		glfwPollEvents();
		glfwSwapBuffers(window);
	}

	glfwTerminate();

	return 0;
}

Video reproduction (2 screens): Screencast from 2024-05-23 21-28-42.webm

And with 1 screen: Screencast from 2024-05-23 21-33-45.webm

With 1 screen and Intel graphics everything is smooth, but CPU load is 55% (+55% from gnome-shell)

My screen is 144 herz, glfwSwapInterval(0) does not fix the problem.

If I call glfwSwapBuffers(window) inside framebufferSizeCallback the window resizes smoother but I can't control it. Screencast from 2024-05-23 22-09-10.webm

Waujito avatar May 23 '24 18:05 Waujito

I have this problem too (on Intel+Nvidia laptop). When resize window that rendered on discrete gpu it very laggy. On intel all works good

maksmakuta avatar May 31 '24 09:05 maksmakuta

Same on intel UHD, and libdecor has nothing to do with it, I am able to reproduce it on triangle-vulkan with libdecor disabled. I suspect GLFW maybe be using wayland protocols incorrectly or suboptimally somewhere.

On a different note, when I look at the perf dump while doing a bunch of resizing, it shows that most time is spent inside of the i915 kernel driver, doing some kind of memory management while trying to submit a vulkan command buffer from inside the GLFW refresh event. Note that same kind of memory heavy management does not occur when drawing normally. Hence, this might also be a kernel driver bug. Or maybe yet another consequence of lack of explicit sync on linux?

Would love to see how perf flamegraphs look or other GPU vendors though.

Finally, thinking about it some more, this might be an entire set of completely different bugs. Libdecor is very much slow, and when I was looking at flamegraphs with it enabled, it was eating up half the redraw time. With it disabled, the i915 driver eats up most of the time. Maybe every single report here is basically caused by the wayland-characteristic client side redraw being slow, but the reason for it being slow is different for every one of the reports?

P.S. I also observed triangle-vulkan crash once, so GLFW might be a little bit at fault here after all.

Mrkol avatar Aug 07 '24 00:08 Mrkol