jsource icon indicating copy to clipboard operation
jsource copied to clipboard

Segfaults on test g420stch.ijs

Open herwinw opened this issue 4 years ago • 8 comments

It might be something weird on my machine, since I haven't seen these show up in the CI runs. My build consistently fail on this one test, regardless whether I build with GCC or Clang, regardless whether it's a debug or release build. It has been like this since my first checkout, so recent refactors are not likely to be the culprit.

$ build/jsrc/Debug/jconsole build/test/Debug/g420stch.ijs
Hello YouTube Viewers, all 22 of you!
see: tsu_notes, tsu_usage, tsu_pacman, and tsu_jd

   RUN  ddall  NB. report scripts that fail
   RECHO ddall NB. echo script names as run and final count of failures

Segmentation fault

Removing this single line from the test stops the segfault

'length error' -: ,.&.>/ etx 1 2;3 4 5

Replacing the expected error message with something else causes the test to fail, but it does prevent the segfault.

I've managed to reproduce in the console with the following code:

join=: ,.&.>/
join 1 2;3 4 5

That join is copied from jlibrary/system/util/pm.ijs, using the operation directly does not result in a segfault.

GDB shows the location of the error

0x00007ffff73c3cab in jtgaf (jt=0x446000, blockx=6) at ../jsrc/m.c:1066
1066	                jt->mfree[-PMINL + 1 + blockx].pool = AFCHAIN(z);  // remove & use the head of the free chain

(gdb) info locals
pushp = 0x6c8db0
z = 0x40
mfreeb = 259200
n = 128

Looking at jt->mfree, there is one very suspicious looking entry, that contains the 0x40 value seen in the z variable:

(gdb) print jt->mfree
$4 = {{ballo = 614592, pool = 0x921cc0}, {ballo = 259200, pool = 0x40}, {ballo = 985344, pool = 0x6c9440}, {ballo = 922112, pool = 0x5f6980}, {ballo = 977920, pool = 0x6795c0}}

PMINL is defined as 6 (jsrc/m.h), the argument blockx equals 6, so the calculation would indeed get the second entry. We've gotten this block a few lines earlier in the m.c file:

z = jt->mfree[-PMINL + 1 + blockx].pool;  // tentatively use head of free list as result - normal case, and even if
                                          // blockx is out of bounds will not segfault

Yes, that comment made me grin.

The segfault itself is caused by the AFCHAIN macro, also in m.c:

#define AFCHAIN(a) ((a)->kchain.chain)            // the chain field, when the block is not allocated

So I guess something returns a nullptr, the first 8 bytes of that something are interpreted as some data, the remainder is memory that is supposed to be usable.

Since this is happening to me with multiple compilers, and nobody else appears to be having this problem, it might be a problem with my libc or something like that.

I'll try to look into it deeper in the next few days

herwinw avatar Feb 20 '21 17:02 herwinw

I also investigated this a little, it's been known issue on osx for some time: https://github.com/codereport/jsource/pull/23#issue-561300714

I bisected one day on the failing case and (at least on osx) the issue was likely introduced in 46c45895c82f4f8b3cbcf256c5d3e028fd768244

juntuu avatar Feb 20 '21 18:02 juntuu

I tried compiling the parent of that commit, but that didn't solve the issue, guess it's not completely the same issue.

herwinw avatar Feb 20 '21 19:02 herwinw

Somewhere between commit ea03e122 and commit 88eb2d83 it has been changed from a segfault into an illegal exception. That first commit is the last one of the original source.

herwinw avatar Feb 20 '21 19:02 herwinw

Might be undefined behavior somewhere, there are at least some "sequence point" warnings in the build. Also got some uninitialized and garbage value warnings with clang analyzer.

juntuu avatar Feb 20 '21 20:02 juntuu

Some more experiments:

  • Adding some unused fields to the jt structure before the mfree struct, or as a first element in the mfree struct does not change the behaviour, the value of mfree[1].pool is still 0x40. It is unlikely that an assignment outside the mfree structure gets out of bounds and accidentally overwrites this.
  • Similar results are seen by swapping the locations of pool and ballo.
  • Increasing the size of mfree by 1 (mfree[-PMINL + PLIML + 1] => mfree[-PMINL + PLIML + 2]) does show an extra empty field in mfree which is empty. We're not hitting weird boundaries there
  • Increasing the value of PLIML by 1 does cause the same error, but this time the extra value of mfree is actually in use
  • Decreasing the value of PMINL by 1 causes errors on other parts of the code. It is documented as "It must be big enough to hold at least one I." I is a 64 bit int, so I'm not sure how that's supposed to fit in the default value of 6 (bytes?).
  • Increasing the value of PMINL by 1 moves the invalid value for pool from the second to the first element of mfree. This looked promising, until I realised the negative value of PMINL is used as an index.
  • Increasing it again to get to 8 to store an I does not change anything compared to 7.
  • The changes of PMINL to 7 and 8 had a second effect, causing the values of the valid pool items increase to values in the range of 0x7ffff0000000. This is rather unexpected, I guess these values should be the memory locations returned by malloc, but that should not be influenced by this.

So at least we've got one thing that looks kind of suspicious.

herwinw avatar Feb 21 '21 16:02 herwinw

As a side note: I'm pretty much using this ticket as a whiteboard, subscribing to changes of it might get a bit spammy.

herwinw avatar Feb 21 '21 16:02 herwinw

Added a few extra lines to m.c

if (z) {                                               // allocate from a chain of free blocks
    jt->mfree[-PMINL + 1 + blockx].pool = AFCHAIN(z);  // remove & use the head of the free chain
    if (jt->mfree[-PMINL + 1 + blockx].pool == (A)0x40) {
        (void)1;
    }

This opens op the possibility for a breakpoint on the (void)1; line. This is indeed the location where the assignment happens (but of course not where the problem starts. A few lines from the debugger:

(gdb) p *z
$1 = {kchain = {k = 64, chain = 0x40, globalst = 0x40}, flag = 0, mback = {m = 7114136, back = 0x6c8d98, zaploc = 0x6c8d98}, tproxy = {t = 2, proxychain = 0x2}, c = 1, n = 1, r = 1, h = 1222, fill = 0, s = {1}}
(gdb) p jt->mfree
$2 = {{ballo = 614592, pool = 0x921cc0}, {ballo = 259072, pool = 0x40}, {ballo = 985344, pool = 0x6c9440}, {ballo = 922112, pool = 0x5f6980}, {ballo = 977920, pool = 0x6795c0}}
(gdb) p jt->mfree[0].pool
$3 = (A) 0x921cc0
(gdb) p *jt->mfree[0].pool
$4 = {kchain = {k = 6664000, chain = 0x65af40, globalst = 0x65af40}, flag = 131104, mback = {m = 4521984, back = 0x450000, zaploc = 0x450000}, tproxy = {t = 32, proxychain = 0x20}, c = 1, n = 36, r = 1, h = 1545, fill = 0, s = {36}}
(gdb) p *jt->mfree[1].pool
Cannot access memory at address 0x40
(gdb) p *jt->mfree[2].pool
$5 = {kchain = {k = 7117376, chain = 0x6c9a40, globalst = 0x6c9a40}, flag = 0, mback = {m = 4515848, back = 0x44e808, zaploc = 0x44e808}, tproxy = {t = 4, proxychain = 0x4}, c = 1, n = 10, r = 2, h = 148, fill = 0, s = {5}}
(gdb) p *jt->mfree[3].pool
$6 = {kchain = {k = 6254464, chain = 0x5f6f80, globalst = 0x5f6f80}, flag = 0, mback = {m = 4515856, back = 0x44e810, zaploc = 0x44e810}, tproxy = {t = 1, proxychain = 0x1}, c = 1, n = 416, r = 2, h = 136, fill = 0, s = {104}}
(gdb) p *jt->mfree[4].pool
$7 = {kchain = {k = 6787520, chain = 0x6791c0, globalst = 0x6791c0}, flag = 0, mback = {m = 4515912, back = 0x44e848, zaploc = 0x44e848}, tproxy = {t = 16, proxychain = 0x10}, c = 0, n = 45, r = 2, h = 80, fill = 0, s = {5}}
(gdb) info locals
pushp = 0x6c8da8
z = 0x721c80
mfreeb = 259072
n = 128

The definition of A (which is a pointer to struct AD) shows a few unions, it looks like we're using a single data type for both memory allocation information and values. The memory has a linked list, and somehow we've ended up with a memory node that is actually the array [1]. We're trying to read the value of chain, but we actually read the value of k that is unioned in the same location. And 1 dimensional arrays have k = 64 = 0x40 (source).

herwinw avatar Feb 21 '21 16:02 herwinw

Played around a bit with different sanitizer flags:

  • -fsanitize=address -> stack buffer overflow detected in the jconsole start up (might be false positive)
  • -fsanitize=undefined -> screen full of warnings about UB, but the previously crashing code ,. each / 1 2;3 4 5 works™
  • didn't test memory/leak sanitizers, as those are not supported on osx

juntuu avatar Feb 24 '21 22:02 juntuu