WickedEngine icon indicating copy to clipboard operation
WickedEngine copied to clipboard

Editor Hangs when changing "content" script (on Linux).

Open ricejasonf opened this issue 1 year ago • 11 comments

Hi, I am not certain that this is related to linux specifically, but when I load different "content" scripts in the editor sometimes the application hangs and sometimes it won't even respond to signals. (ie I have to kill -9 the process.). I tried it in debug mode and found the problem point.

7238     // Initiate stalling CPU when GPU is not yet finished with next frame:
7239     if (FRAMECOUNT >= BUFFERCOUNT)
7240     {
7241       const uint32_t bufferindex = GetBufferIndex();
7242       for (int queue = 0; queue < QUEUE_COUNT; ++queue)
7243       {
7244         if (frame_fence[bufferindex][queue] == VK_NULL_HANDLE)
7245           continue;
7246 
7247         res = vkWaitForFences(device, 1, &frame_fence[bufferindex][queue], VK_TRUE, 0xFFFFFFFFFFFFFFFF);
7248         assert(res == VK_SUCCESS);
7249 
7250         res = vkResetFences(device, 1, &frame_fence[bufferindex][queue]);
7251         assert(res == VK_SUCCESS);
7252       }
7253     }

The call to vkWaitForFences hangs. I am new to this api (and modern graphics in general), but I see that the timeout is very large. Is this the right way to handle "CPU stalling"? I think at least this could loop on VK_TIMEOUT and use a reasonably small timeout (from what I have been googling). Also , here is the call stack from when I was able to stop the process:

* thread #1, name = 'WickedEngineEdi', stop reason = signal SIGSTOP
  * frame #0: 0x00007ffff791d9ed libc.so.6`__poll + 77
    frame #1: 0x00007fffda007cc3 libnvidia-glcore.so.550.78`___lldb_unnamed_symbol36082 + 147
    frame #2: 0x00007fffda422f59 libnvidia-glcore.so.550.78`___lldb_unnamed_symbol44349 + 73
    frame #3: 0x00007fffda407950 libnvidia-glcore.so.550.78`___lldb_unnamed_symbol44160 + 672
    frame #4: 0x00007fffda3239ae libnvidia-glcore.so.550.78`___lldb_unnamed_symbol42754 + 30
    frame #5: 0x0000555555c8668b WickedEngineEditor`wi::graphics::GraphicsDevice_Vulkan::SubmitCommandLists(this=0x000055555705a380) at wiGraphicsDevice_Vulkan.cpp:7247:26
    frame #6: 0x0000555555babf01 WickedEngineEditor`wi::Application::Run(this=0x00007fffff8d4990) at wiApplication.cpp:252:37
    frame #7: 0x00005555555b4661 WickedEngineEditor`sdl_loop(editor=0x00007fffff8d4990) at main_SDL2.cpp:16:19
    frame #8: 0x00005555555b4ce0 WickedEngineEditor`main(argc=1, argv=0x00007fffffffe818) at main_SDL2.cpp:162:23
    frame #9: 0x00007ffff7841d4a libc.so.6`___lldb_unnamed_symbol3264 + 122
    frame #10: 0x00007ffff7841e0c libc.so.6`__libc_start_main + 140
    frame #11: 0x00005555555b4285 WickedEngineEditor`_start + 37

I will play with this more next week, but I thought I would wait for some feedback on the intent with the large timeout.

Thanks.

EDIT: It occurred to me that maybe it is stuck in some loop and it just happens to always break while the process is waiting on that line (7247).

ricejasonf avatar Jun 02 '24 01:06 ricejasonf

Hi, there is the "infinite" timeout for a purpose, it would be invalid to go further while the GPU is not finished with that frame which we are waiting on. Could you make sure that you have updated graphics drivers?

turanszkij avatar Jun 02 '24 03:06 turanszkij

I did a full update and verified I have the latest driver, and I was able to get to freeze again immediately (loading scripts under "Content").

local/nvidia 550.78-7
    NVIDIA drivers for linux

https://archlinux.org/packages/extra/x86_64/nvidia/

ricejasonf avatar Jun 02 '24 04:06 ricejasonf

@ricejasonf Wicked recently updated the dxcompiler to the May version, and that seems to be broken on Linux (#856) and caused all kinds of weird issues on various graphics drivers. It has been reverted to the previous version, can you update to master and give it another try?

brakhane avatar Jun 03 '24 11:06 brakhane

Sorry, but the problem still persists. It does not happen every time, but it still definitely freezes when loading a script.

ricejasonf avatar Jun 03 '24 17:06 ricejasonf

Did you delete the shaders/spirv directory just to make sure no compiled shaders from the dxcompiler remain?

brakhane avatar Jun 03 '24 17:06 brakhane

I deleted the entire build directory. If that is where they are located, then yes. (I am on the Discord if that is easier for back and forth stuff.)

ricejasonf avatar Jun 03 '24 18:06 ricejasonf

I can confirm that it is in fact getting stuck in that vkWaitForFences call. Consider the following small alteration to the point of interest:


7247         while (true) {
7248           res = vkWaitForFences(device, 1, &frame_fence[bufferindex][queue],
7249                                 VK_TRUE, uint64_t{10000000000});
7250           if (res == VK_SUCCESS) break;
7251           assert(res == VK_SUCCESS);
7252         }

Attempting to reproduce the error results in hitting the assert after 10 seconds of blank screen.

WickedEngineEditor: /home/jason/Projects/WickedEngine/WickedEngine/wiGraphicsDevice_Vulkan.cpp:7251: virtual void wi::graphics::GraphicsDevice_Vulkan::SubmitCommandLists(): Assertion `res == VK_SUCCESS' failed.
Aborted (core dumped)

It would be nice to find the bug, but I think there is also an opportunity for graceful error handling here.

ricejasonf avatar Jun 03 '24 18:06 ricejasonf

I realized that this is a duplicate of #804.

ricejasonf avatar Jun 04 '24 18:06 ricejasonf

Can you confirm that the hang always happens when queue is 3 (QUEUE_VIDEO_DECODE)? And never with any other value?

brakhane avatar Jun 04 '24 18:06 brakhane

I tried it several times and the value for queue was consistently 3. So, yes, that looks like the enum value for QUEUE_VIDEO_DECODE as you stated.

ricejasonf avatar Jun 04 '24 19:06 ricejasonf

When resizing the widget window for the entity component system, I can reproduce this very quickly just wagging it back and forth. Still always queue == 3

ricejasonf avatar Jun 05 '24 01:06 ricejasonf

~~Duplicate of #804~~

Edit: I decided to mark 804 as a duplicate; even though it's the older one, most information is in this issue.

brakhane avatar Jan 13 '25 22:01 brakhane

This should be fixed now.

turanszkij avatar Mar 19 '25 16:03 turanszkij