neko icon indicating copy to clipboard operation
neko copied to clipboard

Neko thread usage causes seg faults during global free

Open tobil4sk opened this issue 2 years ago • 2 comments

Ever since haxelib was updated to use threads on neko, it has been segfaulting randomly in github actions. e.g.

Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with 139 in 1s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Segmentation fault (core dumped)

I haven't been able to reproduce at all on any local systems, but I did some troubleshooting and I found that the seg fault occurs after the main function is completed, at some point after this call, but before the program closes: https://github.com/HaxeFoundation/neko/blob/master/vm/main.c#L342.

I managed to download the core dump and load it, and it says that the seg fault comes from line 46 here: https://github.com/HaxeFoundation/neko/blob/9076cfa9dfd517da128a54fcabee5abe4129790b/vm/callback.c#L44-L48

I later added a printf here and confirmed that during the segfault, vm is a null pointer. Perhaps there is a finaliser that is getting called after the main function has already finished or something?

Full backtrace
Core was generated by `haxelib git utest https://github.com/haxe-utest/utest master --always'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
    at /src/vm/callback.c:46
46	/src/vm/callback.c: Bad file descriptor.
[Current thread is 1 (LWP 2473)]
(gdb) bt
#0  0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
    at /src/vm/callback.c:46
#1  0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8)
    at /src/vm/interp.c:708
#2  0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8)
    at /src/vm/interp.c:1214
#3  0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 <t_null>, f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1,
    exc=0x7f30d5e3dd20) at /src/vm/callback.c:117
#4  0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237
#5  0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122
#6  0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2
#7  0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2
#8  0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2
#9  0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) bt full
#0  0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
    at /src/vm/callback.c:46
        vm = 0x0
        old_this = 0x0
        old_env = 0x0
        ret = 0x0
        oldjmp = {{__jmpbuf = {0, 0, 0, 0, 139845314828357, 139847775009936, 7883446016, 16}, __mask_was_saved = -706488560,
            __saved_mask = {__val = {1, 139847723636592, 139847774906397, 1, 139847770864720, 17450007603122798595, 139847750572680,
                139847723636592, 38654705672, 17450007603122798600, 139847750525952, 139847723636592, 139847774911661,
                17450007606711277424, 139847750819840, 139847750572672}}}}
#1  0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8)
    at /src/vm/interp.c:708
        _o = 0x7f30d782a000
        _arg = 0x1
        _f = 0x7f30d8b4d8a0
        acc = 1
        pc = 0x7f30d76efe28
        instructions = {0x7f30d8f170c2 <neko_interp_loop+130>, 0x7f30d8f170dc <neko_interp_loop+156>,
          0x7f30d8f170f5 <neko_interp_loop+181>, 0x7f30d8f1710e <neko_interp_loop+206>, 0x7f30d8f17128 <neko_interp_loop+232>,
          0x7f30d8f17188 <neko_interp_loop+328>, 0x7f30d8f171ab <neko_interp_loop+363>, 0x7f30d8f171c7 <neko_interp_loop+391>,
          0x7f30d8f172d0 <neko_interp_loop+656>, 0x7f30d8f175b4 <neko_interp_loop+1396>, 0x7f30d8f18081 <neko_interp_loop+4161>,
          0x7f30d8f18417 <neko_interp_loop+5079>, 0x7f30d8f18430 <neko_interp_loop+5104>, 0x7f30d8f18453 <neko_interp_loop+5139>,
          0x7f30d8f1846f <neko_interp_loop+5167>, 0x7f30d8f18578 <neko_interp_loop+5432>, 0x7f30d8f18791 <neko_interp_loop+5969>,
          0x7f30d8f18b88 <neko_interp_loop+6984>, 0x7f30d8f18f21 <neko_interp_loop+7905>, 0x7f30d8f18f3e <neko_interp_loop+7934>,
          0x7f30d8f18f9e <neko_interp_loop+8030>, 0x7f30d8f19dc2 <neko_interp_loop+11650>, 0x7f30d8f1a804 <neko_interp_loop+14276>,
          0x7f30d8f1b24f <neko_interp_loop+16911>, 0x7f30d8f1b264 <neko_interp_loop+16932>, 0x7f30d8f1b28e <neko_interp_loop+16974>,
          0x7f30d8f1b2b8 <neko_interp_loop+17016>, 0x7f30d8f1b3c7 <neko_interp_loop+17287>, 0x7f30d8f1b4f6 <neko_interp_loop+17590>,
          0x7f30d8f1b5a2 <neko_interp_loop+17762>, 0x7f30d8f1b716 <neko_interp_loop+18134>, 0x7f30d8f1b847 <neko_interp_loop+18439>,
          0x7f30d8f1b8df <neko_interp_loop+18591>, 0x7f30d8f1b916 <neko_interp_loop+18646>, 0x7f30d8f1b94d <neko_interp_loop+18701>,
          0x7f30d8f1c72d <neko_interp_loop+22253>, 0x7f30d8f1d4d2 <neko_interp_loop+25746>, 0x7f30d8f1e269 <neko_interp_loop+29225>,
          0x7f30d8f1e822 <neko_interp_loop+30690>, 0x7f30d8f1f6d2 <neko_interp_loop+34450>, 0x7f30d8f1f910 <neko_interp_loop+35024>,
          0x7f30d8f1fb4e <neko_interp_loop+35598>, 0x7f30d8f1fd92 <neko_interp_loop+36178>, 0x7f30d8f1ffb8 <neko_interp_loop+36728>,
          0x7f30d8f201de <neko_interp_loop+37278>, 0x7f30d8f20404 <neko_interp_loop+37828>, 0x7f30d8f20487 <neko_interp_loop+37959>,
          0x7f30d8f20603 <neko_interp_loop+38339>, 0x7f30d8f20686 <neko_interp_loop+38470>, 0x7f30d8f204fd <neko_interp_loop+38077>,
          0x7f30d8f20580 <neko_interp_loop+38208>, 0x7f30d8f1b893 <neko_interp_loop+18515>, 0x7f30d8f20709 <neko_interp_loop+38601>,
--Type <RET> for more, q to quit, c to continue without paging--c
          0x7f30d8f20743 <neko_interp_loop+38659>, 0x7f30d8f20808 <neko_interp_loop+38856>, 0x7f30d8f20911 <neko_interp_loop+39121>,
          0x7f30d8f20943 <neko_interp_loop+39171>, 0x7f30d8f18fe0 <neko_interp_loop+8096>, 0x7f30d8f17161 <neko_interp_loop+289>,
          0x7f30d8f17174 <neko_interp_loop+308>, 0x7f30d8f179a7 <neko_interp_loop+2407>, 0x7f30d8f17d10 <neko_interp_loop+3280>,
          0x7f30d8f207c1 <neko_interp_loop+38785>, 0x7f30d8f1929a <neko_interp_loop+8794>, 0x7f30d8f20980 <neko_interp_loop+39232>,
          0x7f30d8f1b7a2 <neko_interp_loop+18274>, 0x7f30d8f1713e <neko_interp_loop+254>, 0x7f30d8f2098f <neko_interp_loop+39247>}
        sp = 0x7f30d6eab7a8
        csp = 0x7f30d6eab058
#2  0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8)
    at /src/vm/interp.c:1214
        sp = 0x7f30d6eab768
        csp = 0x7f30d6eab078
        trap = 0x7f30d6eab738
        init_sp = 7
        m = 0x7f30d8b4cea0
        old = {{__jmpbuf = {0, 4064061087093578727, 140727057721422, 140727057721423, 140727057721680, 139847723638720,
              4064061087267642343, 4064050217118686183}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}
#3  0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 <t_null>, f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1,
    exc=0x7f30d5e3dd20) at /src/vm/callback.c:117
        n = 1
        vm = 0x7f30d77e61c0
        old_this = 0x7f30d914f870 <t_null>
        old_env = 0x7f30d914eee0 <empty_array>
        ret = 0x7f30d914f870 <t_null>
        oldjmp = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}
#4  0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237
        p = 0x7f30d8b490f0
        exc = 0x0
#5  0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122
        lp = 0x7ffd92492990
        p = {init = 0x7f30d7909a1b <thread_init>, main = 0x7f30d7909a99 <thread_loop>, param = 0x7f30d8b490f0, lock = {__data = {
              __lock = 2, __count = 0, __owner = 2429, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
                __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000}\t\000\000\001", '\000' <repeats 26 times>, __align = 2}}
#6  0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#7  0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#8  0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#9  0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.

Here is the code in haxelib that uses threads: https://github.com/HaxeFoundation/haxelib/blob/4.1.x/src/haxelib/client/Vcs.hx#L162-L177

tobil4sk avatar Apr 04 '23 12:04 tobil4sk

We just had a similar crash on Windows, so looks like it's not specific to Linux:

Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with -1073741819 in 3s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]

-1073741819 is equivalent to 0xC0000005, which is STATUS_ACCESS_VIOLATION: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55

tobil4sk avatar Apr 21 '23 13:04 tobil4sk

This sample seems to reproduce the seg fault some of the time, at least on my windows machine:

function main() {
	final streamsLock = new sys.thread.Lock();

	sys.thread.Thread.create(function() {
		Sys.sleep(0.2);
		streamsLock.release();
	});

	sys.thread.Thread.create(function() {
		Sys.sleep(0.2);
		streamsLock.release();
	});

	streamsLock.wait();
	streamsLock.wait();
}

tobil4sk avatar Aug 26 '24 13:08 tobil4sk

On windows, the above sample also sometimes causes this popup:

Image

tobil4sk avatar Jan 16 '25 13:01 tobil4sk

Here is a haxe sample that reproduces the seg fault more reliably:

function main() {
	sys.thread.Thread.create(function() {
		while(true) {
			trace("Hello 1");
		}
	});
	sys.thread.Thread.create(function() {
		while (true) {
			trace("Hello 2");
		}
	});
}

tobil4sk avatar Jan 17 '25 20:01 tobil4sk

On windows, the above sample also sometimes causes this popup:

It looks like this happens because the thread is deleted by DLLMain https://github.com/ivmai/bdwgc/blob/2558568aceaf7fc5cc64cf87e244cbcfd7f9bd53/win32_threads.c#L3009

Somehow this happens at the same time as the GC_gcollect call within neko_gc_major() while neko is shutting down, which also tries to access the same thread to suspend it.

See separate issue: #303

tobil4sk avatar Feb 14 '25 22:02 tobil4sk