LumixEngine
LumixEngine copied to clipboard
jobs::trigger crash
Hi nem0, I've some random crashed in the editor. Note that I'm compiling with VS2022 (x64 Debug) and I've modified Lumix a bit, but this seems a low level thing and I don't think I have mess it up that much.
Stack traces of 2 cases:
studio.exe!Lumix::jobs::trigger<0>(Lumix::jobs::Signal * signal) Line 139 C++ Symbols loaded. studio.exe!Lumix::jobs::manage(void * data) Line 351 C++ Symbols loaded.
studio.exe!Lumix::jobs::trigger<0>(Lumix::jobs::Signal * signal) Line 153 C++ studio.exe!Lumix::jobs::manage(void * data) Line 351 C++
The problem is that signal is all overwritten with 0xCC, it happens here:
LUMIX_FORCE_INLINE static bool trigger(Signal* signal)
{
Waitor* waitor;
{
Lumix::MutexGuard lock(g_system->m_sync);
if constexpr (ZERO) {
signal->counter = 0;
}
else {
--signal->counter;
ASSERT(signal->counter >= 0);
if (signal->counter > 0) return false;
}
waitor = signal->waitor;
signal->waitor = nullptr;
}
if (!waitor) return false;
After counter goes from 1 to 0, at random in the following instructions the memory of waitor/signal is deleted and overitten with CC. Entering the trigger function I made a copy of signal and it is always fine:
[0] 0 '\0' unsigned char <- waitor*
[1] 0 '\0' unsigned char
[2] 0 '\0' unsigned char
[3] 0 '\0' unsigned char
[4] 0 '\0' unsigned char
[5] 0 '\0' unsigned char
[6] 0 '\0' unsigned char
[7] 0 '\0' unsigned char
[8] 1 '\x1' unsigned char <- counter LSB
[9] 0 '\0' unsigned char
[10] 0 '\0' unsigned char
[11] 0 '\0' unsigned char
[12] 245 'õ' unsigned char <- generation LSB
[13] 99 'c' unsigned char
[14] 69 'E' unsigned char
[15] 0 '\0' unsigned char
Going back to the function manage task, I can see the job which was completed, the job task is the lambda of jobs::runOnWorkers. When the problem arises only two runOnWorkers are running: "culling" and "create keys". The data of the job is on the same area of the signal, which is again all CC.
It is like the wait of runOnWorkers exits prematurely, runOnWorkers ends, and its local signal is deleted. A possible cause could be the first unprotected 'if (signal->counter == 0)' in waitEx. For example: a thread executes trigger, locks and decrements the counter to zero, just after that thread is suspended (for whatever reason), then a different thread, the one that launched runOnWorkers executes wait, finds the counter to zero, and exits deleting the signal. Does it make sense?
Hi divinon,
you can try to remove the first if
in waitEx
: if (signal->counter == 0) return;
and see if the issue is still there
Yes, up now, no more crashes, but it is difficult to say, it was rare and completely random. However, I've also done the opposite by putting that 'if' in a big for loop, and in that way I had a crash within a minute, so for me it is a likely culprit.
After #1450 the crash seems to be gone