M2 icon indicating copy to clipboard operation
M2 copied to clipboard

segmentation fault in nets.d: VerticalJoin

Open mahrud opened this issue 5 years ago • 12 comments

Finally managed to get boost to pretty print stacktraces despite ASLR and get some output form the crashes in Ubuntu. Here's the line from the stacktrace:

FAILED: usr-dist/common/share/Macaulay2/Core/tvalues.m2 
cd /home/runner/work/M2/M2/M2/BUILD/cicd/usr-dist/common/share/Macaulay2/Core && /home/runner/work/M2/M2/M2/BUILD/cicd/usr-dist/x86_64-Linux-Ubuntu-18.04/bin/M2-binary -q --silent --stop -e errorDepth=0 --no-preload --no-tvalues /home/runner/work/M2/M2/M2/Macaulay2/m2/tvalues-make.m2 -e "make \"/home/runner/work/M2/M2/M2/Macaulay2/d/\"; exit 0"
-- SIGSEGV
-* stack trace, pid: 119184
 0# stack_trace(std::ostream&, bool) at ../../Macaulay2/bin/main.cpp:124
 1# segv_handler at ../../Macaulay2/bin/main.cpp:241
 2# 0x00007FAFE6274F20 in /lib/x86_64-linux-gnu/libc.so.6
 3# nets_VerticalJoin at /home/runner/work/M2/M2/M2/Macaulay2/d/nets.d:132
 4# evaluate_evalraw at /home/runner/work/M2/M2/M2/Macaulay2/d/evaluate.d:1293
...

@DanGrayson Any ideas why this might be happening?

mahrud avatar Jul 04 '20 02:07 mahrud

Line 132 in nets.d is

	  leng = leng + length(n.body);

, which translates to the following C code:

  leng_1 = (leng_1 + tmp__79->array[tmp__80]->body->len);

There is no function call to a function in libc on that line, so maybe line 2 of the stack trace is an interrupt handler routine, too. In that case, one of the three memory accesses must be out of bounds. If so, the most likely explanation for it is that we have a systematic screw-up in the handling of libgc memory, and some corruption has occurred. In that case, a lengthy session with a debugger is called for. Such a screw-up is more likely, since the eigen branch was merged not so long ago. An example of a screw-up would be storing a pointer to libgc memory in malloc memory, and then using it after the garbage collector has collected it. That could be anywhere else in the code, for after collection, the memory can be re-allocated and scribbled on.

DanGrayson avatar Jul 04 '20 17:07 DanGrayson

Here's a different segfault from the same step:

2020-07-09T07:32:52.9528468Z [327/533] Generating Macaulay2/Core/tvalues.m2
2020-07-09T07:32:52.9529470Z FAILED: usr-dist/common/share/Macaulay2/Core/tvalues.m2 
2020-07-09T07:32:52.9530371Z cd /home/runner/work/M2/M2/M2/BUILD/build/usr-dist/common/share/Macaulay2/Core && /home/runner/work/M2/M2/M2/BUILD/build/usr-dist/x86_64-Linux-Ubuntu-18.04/bin/M2-binary -q --silent --stop -e errorDepth=0 --no-preload --no-tvalues /home/runner/work/M2/M2/M2/Macaulay2/m2/tvalues-make.m2 -e "make \"/home/runner/work/M2/M2/M2/Macaulay2/d/\"; exit 0"
2020-07-09T07:32:52.9530884Z -- SIGSEGV
2020-07-09T07:32:52.9531524Z -* stack trace, pid: 78155
2020-07-09T07:32:52.9531994Z  0# stack_trace(std::ostream&, bool) at ../../Macaulay2/bin/main.cpp:124
2020-07-09T07:32:52.9532288Z  1# segv_handler at ../../Macaulay2/bin/main.cpp:241
2020-07-09T07:32:52.9532744Z  2# 0x00007F062E28EF20 in /lib/x86_64-linux-gnu/libc.so.6
2020-07-09T07:32:52.9533040Z  3# binding_lookup_1 at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:424
2020-07-09T07:32:52.9533321Z  4# lookup at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:426
2020-07-09T07:32:52.9533607Z  5# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:684
2020-07-09T07:32:52.9533890Z  6# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:716
2020-07-09T07:32:52.9534165Z  7# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:716
2020-07-09T07:32:52.9534435Z  8# binding_bind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:670
2020-07-09T07:32:52.9534709Z  9# binding_localBind at /home/runner/work/M2/M2/M2/Macaulay2/d/binding.d:807
2020-07-09T07:32:52.9534997Z 10# readeval3 at /home/runner/work/M2/M2/M2/Macaulay2/d/interp.dd:272
2020-07-09T07:32:52.9535274Z 11# readeval at /home/runner/work/M2/M2/M2/Macaulay2/d/interp.dd:285
2020-07-09T07:32:52.9535554Z 12# interp_process at /home/runner/work/M2/M2/M2/Macaulay2/d/interp.dd:600

mahrud avatar Jul 09 '20 08:07 mahrud

There's nothing on that line that could cause a segmentation fault, so there must be something missing from the stack trace: return binding_globalLookup(w);

DanGrayson avatar Jul 09 '20 12:07 DanGrayson

The stack trace is just using libbacktrace. If the line numbers are wrong, perhaps the scc1 generated line numbers are wrong?

mahrud avatar Jul 09 '20 20:07 mahrud

No, because I looked in the corresponding C files, too.

DanGrayson avatar Jul 09 '20 20:07 DanGrayson

Well, one far-fetched possibility is that something scribbled over the return address on the stack, so when the function returned, it went into outer space.

DanGrayson avatar Jul 09 '20 20:07 DanGrayson

I still don't understand how this only happens in github actions. Perhaps we need to try it with the same hardware limits?

mahrud avatar Jul 09 '20 21:07 mahrud

7GB of ram is more than enough. I'm at 2.9GB for my Ubuntu 18 virtual machine.

Could you point me to an action where you see that error?

DanGrayson avatar Jul 09 '20 21:07 DanGrayson

I didn't keep the log for the last stack trace, but see the top comment for a link to the other stack trace.

On Thu, Jul 9, 2020, 4:22 PM Daniel R. Grayson [email protected] wrote:

7GB of ram is more than enough. I'm at 2.9GB for my Ubuntu 18 virtual machine.

Could you point me to an action where you see that error?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Macaulay2/M2/issues/1370#issuecomment-656357605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYAPRSN6YKZOSO2WPP7IFLR2YYHVANCNFSM4OQGEPYA .

mahrud avatar Jul 09 '20 22:07 mahrud

I happened again: https://github.com/mahrud/M2/runs/857198730?check_suite_focus=true#step:11:10359

mahrud avatar Jul 10 '20 08:07 mahrud

So, at the same place as before. Here is all the C code on that "line":

  # line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
  return binding_globalLookup(w);
  # line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
  }
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
static M2_string str__108;
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
static M2_string str__109;
# line 424 "/home/dan/src/M2/M2/Macaulay2/d/binding.d"
static M2_string str__110;

A "return" statement can cause a segmentation fault only if someone has scribbled on the stack so the return address is bad. I think.

Here's the corresponding D code:

export lookup(w:Word,d:Dictionary):(null or Symbol) := (
     while (
	  when lookup(w,d.symboltable) is null do nothing is e:Symbol do return e;
	  d != d.outerDictionary ) do d = d.outerDictionary;
     globalLookup(w));

Does it happen just with gcc-9?

DanGrayson avatar Jul 10 '20 12:07 DanGrayson

I've seen it happen with gcc-6 as well, but always only on ubuntu.

mahrud avatar Jul 10 '20 13:07 mahrud