M2
M2 copied to clipboard
segmentation fault in nets.d: VerticalJoin
Finally managed to get boost to pretty print stacktraces despite ASLR and get some output form the crashes in Ubuntu. Here's the line from the stacktrace:
FAILED: usr-dist/common/share/Macaulay2/Core/tvalues.m2
cd /home/runner/work/M2/M2/M2/BUILD/cicd/usr-dist/common/share/Macaulay2/Core && /home/runner/work/M2/M2/M2/BUILD/cicd/usr-dist/x86_64-Linux-Ubuntu-18.04/bin/M2-binary -q --silent --stop -e errorDepth=0 --no-preload --no-tvalues /home/runner/work/M2/M2/M2/Macaulay2/m2/tvalues-make.m2 -e "make \"/home/runner/work/M2/M2/M2/Macaulay2/d/\"; exit 0"
-- SIGSEGV
-* stack trace, pid: 119184
0# stack_trace(std::ostream&, bool) at ../../Macaulay2/bin/main.cpp:124
1# segv_handler at ../../Macaulay2/bin/main.cpp:241
2# 0x00007FAFE6274F20 in /lib/x86_64-linux-gnu/libc.so.6
3# nets_VerticalJoin at /home/runner/work/M2/M2/M2/Macaulay2/d/nets.d:132
4# evaluate_evalraw at /home/runner/work/M2/M2/M2/Macaulay2/d/evaluate.d:1293
...
@DanGrayson Any ideas why this might be happening?
Line 132 in nets.d is
leng = leng + length(n.body);
, which translates to the following C code:
leng_1 = (leng_1 + tmp__79->array[tmp__80]->body->len);
There is no function call to a function in libc on that line, so maybe line 2 of the stack trace is an interrupt handler routine, too. In that case, one of the three memory accesses must be out of bounds. If so, the most likely explanation for it is that we have a systematic screw-up in the handling of libgc memory, and some corruption has occurred. In that case, a lengthy session with a debugger is called for. Such a screw-up is more likely, since the eigen branch was merged not so long ago. An example of a screw-up would be storing a pointer to libgc memory in malloc memory, and then using it after the garbage collector has collected it. That could be anywhere else in the code, for after collection, the memory can be re-allocated and scribbled on.
Here's a different segfault from the same step:
2020-07-09T07:32:52.9528468Z [327/533] Generating Macaulay2/Core/tvalues.m2
2020-07-09T07:32:52.9529470Z FAILED: usr-dist/common/share/Macaulay2/Core/tvalues.m2
2020-07-09T07:32:52.9530371Z cd /home/runner/work/M2/M2/M2/BUILD/build/usr-dist/common/share/Macaulay2/Core && /home/runner/work/M2/M2/M2/BUILD/build/usr-dist/x86_64-Linux-Ubuntu-18.04/bin/M2-binary -q --silent --stop -e errorDepth=0 --no-preload --no-tvalues /home/runner/work/M2/M2/M2/Macaulay2/m2/tvalues-make.m2 -e "make \"/home/runner/work/M2/M2/M2/Macaulay2/d/\"; exit 0"
2020-07-09T07:32:52.9530884Z -- SIGSEGV
2020-07-09T07:32:52.9531524Z -* stack trace, pid: 78155
2020-07-09T07:32:52.9531994Z 0# stack_trace(std::ostream&, bool) at ../../Macaulay2/bin/main.cpp:124
2020-07-09T07:32:52.9532288Z 1# segv_handler at ../../Macaulay2/bin/main.cpp:241
2020-07-09T07:32:52.9532744Z 2# 0x00007F062E28EF20 in /lib/x86_64-linux-gnu/libc.so.6
2020-07-09T07:32:52.9533040Z 3# binding_lookup_1 at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:424
2020-07-09T07:32:52.9533321Z 4# lookup at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:426
2020-07-09T07:32:52.9533607Z 5# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:684
2020-07-09T07:32:52.9533890Z 6# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:716
2020-07-09T07:32:52.9534165Z 7# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:716
2020-07-09T07:32:52.9534435Z 8# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:670
2020-07-09T07:32:52.9534709Z 9# binding_localBind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:807
2020-07-09T07:32:52.9534997Z 10# readeval3 at /home/runner/work/M2/M2/M2/Macaulay2/d/interp.dd:272
2020-07-09T07:32:52.9535274Z 11# readeval at /home/runner/work/M2/M2/M2/Macaulay2/d/interp.dd:285
2020-07-09T07:32:52.9535554Z 12# interp_process at /home/runner/work/M2/M2/M2/Macaulay2/d/interp.dd:600
There's nothing on that line that could cause a segmentation fault, so there must be something missing from the stack trace:
return binding_globalLookup(w);
The stack trace is just using libbacktrace. If the line numbers are wrong, perhaps the scc1 generated line numbers are wrong?
No, because I looked in the corresponding C files, too.
Well, one far-fetched possibility is that something scribbled over the return address on the stack, so when the function returned, it went into outer space.
I still don't understand how this only happens in github actions. Perhaps we need to try it with the same hardware limits?
7GB of ram is more than enough. I'm at 2.9GB for my Ubuntu 18 virtual machine.
Could you point me to an action where you see that error?
I didn't keep the log for the last stack trace, but see the top comment for a link to the other stack trace.
On Thu, Jul 9, 2020, 4:22 PM Daniel R. Grayson [email protected] wrote:
7GB of ram is more than enough. I'm at 2.9GB for my Ubuntu 18 virtual machine.
Could you point me to an action where you see that error?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Macaulay2/M2/issues/1370#issuecomment-656357605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYAPRSN6YKZOSO2WPP7IFLR2YYHVANCNFSM4OQGEPYA .
I happened again: https://github.com/mahrud/M2/runs/857198730?check_suite_focus=true#step:11:10359
So, at the same place as before. Here is all the C code on that "line":
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
return binding_globalLookup(w);
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
}
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
static M2_string str__108;
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
static M2_string str__109;
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
static M2_string str__110;
A "return" statement can cause a segmentation fault only if someone has scribbled on the stack so the return address is bad. I think.
Here's the corresponding D code:
export lookup(w:Word,d:Dictionary):(null or Symbol) := (
while (
when lookup(w,d.symboltable) is null do nothing is e:Symbol do return e;
d != d.outerDictionary ) do d = d.outerDictionary;
globalLookup(w));
Does it happen just with gcc-9?
I've seen it happen with gcc-6 as well, but always only on ubuntu.