jsource
jsource copied to clipboard
Segfaults on test g420stch.ijs
It might be something weird on my machine, since I haven't seen these show up in the CI runs. My build consistently fail on this one test, regardless whether I build with GCC or Clang, regardless whether it's a debug or release build. It has been like this since my first checkout, so recent refactors are not likely to be the culprit.
$ build/jsrc/Debug/jconsole build/test/Debug/g420stch.ijs
Hello YouTube Viewers, all 22 of you!
see: tsu_notes, tsu_usage, tsu_pacman, and tsu_jd
RUN ddall NB. report scripts that fail
RECHO ddall NB. echo script names as run and final count of failures
Segmentation fault
Removing this single line from the test stops the segfault
'length error' -: ,.&.>/ etx 1 2;3 4 5
Replacing the expected error message with something else causes the test to fail, but it does prevent the segfault.
I've managed to reproduce in the console with the following code:
join=: ,.&.>/
join 1 2;3 4 5
That join is copied from jlibrary/system/util/pm.ijs, using the operation directly does not result in a segfault.
GDB shows the location of the error
0x00007ffff73c3cab in jtgaf (jt=0x446000, blockx=6) at ../jsrc/m.c:1066
1066 jt->mfree[-PMINL + 1 + blockx].pool = AFCHAIN(z); // remove & use the head of the free chain
(gdb) info locals
pushp = 0x6c8db0
z = 0x40
mfreeb = 259200
n = 128
Looking at jt->mfree, there is one very suspicious looking entry, that contains the 0x40 value seen in the z variable:
(gdb) print jt->mfree
$4 = {{ballo = 614592, pool = 0x921cc0}, {ballo = 259200, pool = 0x40}, {ballo = 985344, pool = 0x6c9440}, {ballo = 922112, pool = 0x5f6980}, {ballo = 977920, pool = 0x6795c0}}
PMINL is defined as 6 (jsrc/m.h), the argument blockx equals 6, so the calculation would indeed get the second entry. We've gotten this block a few lines earlier in the m.c file:
z = jt->mfree[-PMINL + 1 + blockx].pool; // tentatively use head of free list as result - normal case, and even if
// blockx is out of bounds will not segfault
Yes, that comment made me grin.
The segfault itself is caused by the AFCHAIN macro, also in m.c:
#define AFCHAIN(a) ((a)->kchain.chain) // the chain field, when the block is not allocated
So I guess something returns a nullptr, the first 8 bytes of that something are interpreted as some data, the remainder is memory that is supposed to be usable.
Since this is happening to me with multiple compilers, and nobody else appears to be having this problem, it might be a problem with my libc or something like that.
I'll try to look into it deeper in the next few days
I also investigated this a little, it's been known issue on osx for some time: https://github.com/codereport/jsource/pull/23#issue-561300714
I bisected one day on the failing case and (at least on osx) the issue was likely introduced in 46c45895c82f4f8b3cbcf256c5d3e028fd768244
I tried compiling the parent of that commit, but that didn't solve the issue, guess it's not completely the same issue.
Somewhere between commit ea03e122 and commit 88eb2d83 it has been changed from a segfault into an illegal exception. That first commit is the last one of the original source.
Might be undefined behavior somewhere, there are at least some "sequence point" warnings in the build. Also got some uninitialized and garbage value warnings with clang analyzer.
Some more experiments:
- Adding some unused fields to the
jtstructure before themfreestruct, or as a first element in themfreestruct does not change the behaviour, the value ofmfree[1].poolis still0x40. It is unlikely that an assignment outside themfreestructure gets out of bounds and accidentally overwrites this. - Similar results are seen by swapping the locations of
poolandballo. - Increasing the size of
mfreeby 1 (mfree[-PMINL + PLIML + 1]=>mfree[-PMINL + PLIML + 2]) does show an extra empty field inmfreewhich is empty. We're not hitting weird boundaries there - Increasing the value of
PLIMLby 1 does cause the same error, but this time the extra value ofmfreeis actually in use - Decreasing the value of
PMINLby 1 causes errors on other parts of the code. It is documented as "It must be big enough to hold at least one I."Iis a 64 bit int, so I'm not sure how that's supposed to fit in the default value of 6 (bytes?). - Increasing the value of
PMINLby 1 moves the invalid value forpoolfrom the second to the first element ofmfree. This looked promising, until I realised the negative value ofPMINLis used as an index. - Increasing it again to get to 8 to store an
Idoes not change anything compared to 7. - The changes of
PMINLto7and8had a second effect, causing the values of the validpoolitems increase to values in the range of0x7ffff0000000. This is rather unexpected, I guess these values should be the memory locations returned bymalloc, but that should not be influenced by this.
So at least we've got one thing that looks kind of suspicious.
As a side note: I'm pretty much using this ticket as a whiteboard, subscribing to changes of it might get a bit spammy.
Added a few extra lines to m.c
if (z) { // allocate from a chain of free blocks
jt->mfree[-PMINL + 1 + blockx].pool = AFCHAIN(z); // remove & use the head of the free chain
if (jt->mfree[-PMINL + 1 + blockx].pool == (A)0x40) {
(void)1;
}
This opens op the possibility for a breakpoint on the (void)1; line. This is indeed the location where the assignment happens (but of course not where the problem starts. A few lines from the debugger:
(gdb) p *z
$1 = {kchain = {k = 64, chain = 0x40, globalst = 0x40}, flag = 0, mback = {m = 7114136, back = 0x6c8d98, zaploc = 0x6c8d98}, tproxy = {t = 2, proxychain = 0x2}, c = 1, n = 1, r = 1, h = 1222, fill = 0, s = {1}}
(gdb) p jt->mfree
$2 = {{ballo = 614592, pool = 0x921cc0}, {ballo = 259072, pool = 0x40}, {ballo = 985344, pool = 0x6c9440}, {ballo = 922112, pool = 0x5f6980}, {ballo = 977920, pool = 0x6795c0}}
(gdb) p jt->mfree[0].pool
$3 = (A) 0x921cc0
(gdb) p *jt->mfree[0].pool
$4 = {kchain = {k = 6664000, chain = 0x65af40, globalst = 0x65af40}, flag = 131104, mback = {m = 4521984, back = 0x450000, zaploc = 0x450000}, tproxy = {t = 32, proxychain = 0x20}, c = 1, n = 36, r = 1, h = 1545, fill = 0, s = {36}}
(gdb) p *jt->mfree[1].pool
Cannot access memory at address 0x40
(gdb) p *jt->mfree[2].pool
$5 = {kchain = {k = 7117376, chain = 0x6c9a40, globalst = 0x6c9a40}, flag = 0, mback = {m = 4515848, back = 0x44e808, zaploc = 0x44e808}, tproxy = {t = 4, proxychain = 0x4}, c = 1, n = 10, r = 2, h = 148, fill = 0, s = {5}}
(gdb) p *jt->mfree[3].pool
$6 = {kchain = {k = 6254464, chain = 0x5f6f80, globalst = 0x5f6f80}, flag = 0, mback = {m = 4515856, back = 0x44e810, zaploc = 0x44e810}, tproxy = {t = 1, proxychain = 0x1}, c = 1, n = 416, r = 2, h = 136, fill = 0, s = {104}}
(gdb) p *jt->mfree[4].pool
$7 = {kchain = {k = 6787520, chain = 0x6791c0, globalst = 0x6791c0}, flag = 0, mback = {m = 4515912, back = 0x44e848, zaploc = 0x44e848}, tproxy = {t = 16, proxychain = 0x10}, c = 0, n = 45, r = 2, h = 80, fill = 0, s = {5}}
(gdb) info locals
pushp = 0x6c8da8
z = 0x721c80
mfreeb = 259072
n = 128
The definition of A (which is a pointer to struct AD) shows a few unions, it looks like we're using a single data type for both memory allocation information and values. The memory has a linked list, and somehow we've ended up with a memory node that is actually the array [1]. We're trying to read the value of chain, but we actually read the value of k that is unioned in the same location. And 1 dimensional arrays have k = 64 = 0x40 (source).
Played around a bit with different sanitizer flags:
-fsanitize=address-> stack buffer overflow detected in the jconsole start up (might be false positive)-fsanitize=undefined-> screen full of warnings about UB, but the previously crashing code,. each / 1 2;3 4 5works™- didn't test memory/leak sanitizers, as those are not supported on osx