Neko thread usage causes seg faults during global free
Ever since haxelib was updated to use threads on neko, it has been segfaulting randomly in github actions. e.g.
Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with 139 in 1s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Segmentation fault (core dumped)
I haven't been able to reproduce at all on any local systems, but I did some troubleshooting and I found that the seg fault occurs after the main function is completed, at some point after this call, but before the program closes: https://github.com/HaxeFoundation/neko/blob/master/vm/main.c#L342.
I managed to download the core dump and load it, and it says that the seg fault comes from line 46 here: https://github.com/HaxeFoundation/neko/blob/9076cfa9dfd517da128a54fcabee5abe4129790b/vm/callback.c#L44-L48
I later added a printf here and confirmed that during the segfault, vm is a null pointer. Perhaps there is a finaliser that is getting called after the main function has already finished or something?
Full backtrace
Core was generated by `haxelib git utest https://github.com/haxe-utest/utest master --always'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
at /src/vm/callback.c:46
46 /src/vm/callback.c: Bad file descriptor.
[Current thread is 1 (LWP 2473)]
(gdb) bt
#0 0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
at /src/vm/callback.c:46
#1 0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8)
at /src/vm/interp.c:708
#2 0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8)
at /src/vm/interp.c:1214
#3 0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 <t_null>, f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1,
exc=0x7f30d5e3dd20) at /src/vm/callback.c:117
#4 0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237
#5 0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122
#6 0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2
#7 0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2
#8 0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2
#9 0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) bt full
#0 0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
at /src/vm/callback.c:46
vm = 0x0
old_this = 0x0
old_env = 0x0
ret = 0x0
oldjmp = {{__jmpbuf = {0, 0, 0, 0, 139845314828357, 139847775009936, 7883446016, 16}, __mask_was_saved = -706488560,
__saved_mask = {__val = {1, 139847723636592, 139847774906397, 1, 139847770864720, 17450007603122798595, 139847750572680,
139847723636592, 38654705672, 17450007603122798600, 139847750525952, 139847723636592, 139847774911661,
17450007606711277424, 139847750819840, 139847750572672}}}}
#1 0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8)
at /src/vm/interp.c:708
_o = 0x7f30d782a000
_arg = 0x1
_f = 0x7f30d8b4d8a0
acc = 1
pc = 0x7f30d76efe28
instructions = {0x7f30d8f170c2 <neko_interp_loop+130>, 0x7f30d8f170dc <neko_interp_loop+156>,
0x7f30d8f170f5 <neko_interp_loop+181>, 0x7f30d8f1710e <neko_interp_loop+206>, 0x7f30d8f17128 <neko_interp_loop+232>,
0x7f30d8f17188 <neko_interp_loop+328>, 0x7f30d8f171ab <neko_interp_loop+363>, 0x7f30d8f171c7 <neko_interp_loop+391>,
0x7f30d8f172d0 <neko_interp_loop+656>, 0x7f30d8f175b4 <neko_interp_loop+1396>, 0x7f30d8f18081 <neko_interp_loop+4161>,
0x7f30d8f18417 <neko_interp_loop+5079>, 0x7f30d8f18430 <neko_interp_loop+5104>, 0x7f30d8f18453 <neko_interp_loop+5139>,
0x7f30d8f1846f <neko_interp_loop+5167>, 0x7f30d8f18578 <neko_interp_loop+5432>, 0x7f30d8f18791 <neko_interp_loop+5969>,
0x7f30d8f18b88 <neko_interp_loop+6984>, 0x7f30d8f18f21 <neko_interp_loop+7905>, 0x7f30d8f18f3e <neko_interp_loop+7934>,
0x7f30d8f18f9e <neko_interp_loop+8030>, 0x7f30d8f19dc2 <neko_interp_loop+11650>, 0x7f30d8f1a804 <neko_interp_loop+14276>,
0x7f30d8f1b24f <neko_interp_loop+16911>, 0x7f30d8f1b264 <neko_interp_loop+16932>, 0x7f30d8f1b28e <neko_interp_loop+16974>,
0x7f30d8f1b2b8 <neko_interp_loop+17016>, 0x7f30d8f1b3c7 <neko_interp_loop+17287>, 0x7f30d8f1b4f6 <neko_interp_loop+17590>,
0x7f30d8f1b5a2 <neko_interp_loop+17762>, 0x7f30d8f1b716 <neko_interp_loop+18134>, 0x7f30d8f1b847 <neko_interp_loop+18439>,
0x7f30d8f1b8df <neko_interp_loop+18591>, 0x7f30d8f1b916 <neko_interp_loop+18646>, 0x7f30d8f1b94d <neko_interp_loop+18701>,
0x7f30d8f1c72d <neko_interp_loop+22253>, 0x7f30d8f1d4d2 <neko_interp_loop+25746>, 0x7f30d8f1e269 <neko_interp_loop+29225>,
0x7f30d8f1e822 <neko_interp_loop+30690>, 0x7f30d8f1f6d2 <neko_interp_loop+34450>, 0x7f30d8f1f910 <neko_interp_loop+35024>,
0x7f30d8f1fb4e <neko_interp_loop+35598>, 0x7f30d8f1fd92 <neko_interp_loop+36178>, 0x7f30d8f1ffb8 <neko_interp_loop+36728>,
0x7f30d8f201de <neko_interp_loop+37278>, 0x7f30d8f20404 <neko_interp_loop+37828>, 0x7f30d8f20487 <neko_interp_loop+37959>,
0x7f30d8f20603 <neko_interp_loop+38339>, 0x7f30d8f20686 <neko_interp_loop+38470>, 0x7f30d8f204fd <neko_interp_loop+38077>,
0x7f30d8f20580 <neko_interp_loop+38208>, 0x7f30d8f1b893 <neko_interp_loop+18515>, 0x7f30d8f20709 <neko_interp_loop+38601>,
--Type <RET> for more, q to quit, c to continue without paging--c
0x7f30d8f20743 <neko_interp_loop+38659>, 0x7f30d8f20808 <neko_interp_loop+38856>, 0x7f30d8f20911 <neko_interp_loop+39121>,
0x7f30d8f20943 <neko_interp_loop+39171>, 0x7f30d8f18fe0 <neko_interp_loop+8096>, 0x7f30d8f17161 <neko_interp_loop+289>,
0x7f30d8f17174 <neko_interp_loop+308>, 0x7f30d8f179a7 <neko_interp_loop+2407>, 0x7f30d8f17d10 <neko_interp_loop+3280>,
0x7f30d8f207c1 <neko_interp_loop+38785>, 0x7f30d8f1929a <neko_interp_loop+8794>, 0x7f30d8f20980 <neko_interp_loop+39232>,
0x7f30d8f1b7a2 <neko_interp_loop+18274>, 0x7f30d8f1713e <neko_interp_loop+254>, 0x7f30d8f2098f <neko_interp_loop+39247>}
sp = 0x7f30d6eab7a8
csp = 0x7f30d6eab058
#2 0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8)
at /src/vm/interp.c:1214
sp = 0x7f30d6eab768
csp = 0x7f30d6eab078
trap = 0x7f30d6eab738
init_sp = 7
m = 0x7f30d8b4cea0
old = {{__jmpbuf = {0, 4064061087093578727, 140727057721422, 140727057721423, 140727057721680, 139847723638720,
4064061087267642343, 4064050217118686183}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}
#3 0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 <t_null>, f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1,
exc=0x7f30d5e3dd20) at /src/vm/callback.c:117
n = 1
vm = 0x7f30d77e61c0
old_this = 0x7f30d914f870 <t_null>
old_env = 0x7f30d914eee0 <empty_array>
ret = 0x7f30d914f870 <t_null>
oldjmp = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}
#4 0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237
p = 0x7f30d8b490f0
exc = 0x0
#5 0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122
lp = 0x7ffd92492990
p = {init = 0x7f30d7909a1b <thread_init>, main = 0x7f30d7909a99 <thread_loop>, param = 0x7f30d8b490f0, lock = {__data = {
__lock = 2, __count = 0, __owner = 2429, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
__next = 0x0}}, __size = "\002\000\000\000\000\000\000\000}\t\000\000\001", '\000' <repeats 26 times>, __align = 2}}
#6 0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#7 0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#8 0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#9 0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.
Here is the code in haxelib that uses threads: https://github.com/HaxeFoundation/haxelib/blob/4.1.x/src/haxelib/client/Vcs.hx#L162-L177
We just had a similar crash on Windows, so looks like it's not specific to Linux:
Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with -1073741819 in 3s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
-1073741819 is equivalent to 0xC0000005, which is STATUS_ACCESS_VIOLATION: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
This sample seems to reproduce the seg fault some of the time, at least on my windows machine:
function main() {
final streamsLock = new sys.thread.Lock();
sys.thread.Thread.create(function() {
Sys.sleep(0.2);
streamsLock.release();
});
sys.thread.Thread.create(function() {
Sys.sleep(0.2);
streamsLock.release();
});
streamsLock.wait();
streamsLock.wait();
}
On windows, the above sample also sometimes causes this popup:
Here is a haxe sample that reproduces the seg fault more reliably:
function main() {
sys.thread.Thread.create(function() {
while(true) {
trace("Hello 1");
}
});
sys.thread.Thread.create(function() {
while (true) {
trace("Hello 2");
}
});
}
On windows, the above sample also sometimes causes this popup:
It looks like this happens because the thread is deleted by DLLMain https://github.com/ivmai/bdwgc/blob/2558568aceaf7fc5cc64cf87e244cbcfd7f9bd53/win32_threads.c#L3009
Somehow this happens at the same time as the GC_gcollect call within neko_gc_major() while neko is shutting down, which also tries to access the same thread to suspend it.
See separate issue: #303