nemo icon indicating copy to clipboard operation
nemo copied to clipboard

hackcode1 segfaults intermittently

Open teuben opened this issue 3 years ago • 16 comments

hackcode1 now intermittendly segfaults. was already the case on Ubuntu20, persisting on U22. Slight correction: it's actually a bus error

teuben avatar May 14 '22 01:05 teuben

it seems on those machines where it crashes the default buiild gives a coredumping program, but the re-compile

         mknemo -t -T hackcode1

gives a working code.

teuben avatar May 17 '22 23:05 teuben

The branch issue98 has a script to simplify triggering the bug. Making some progress there, but this sure is a hard nut to crack.

teuben avatar Oct 13 '22 13:10 teuben

From latest (NEMO) master version on github, and from macosx platform/clang, hackcode1 segfault, which fails io_nemo_test (make check)

jcldc avatar Oct 13 '22 13:10 jcldc

So far I was not able to crash it on an AMD, but I agree I could crash it on Mac as well. Compiler on Intel bug? I was able to be gdb and see the structure with members point to random 64bit values, where it then segfaults. I need a rainy day.

teuben avatar Oct 13 '22 14:10 teuben

I need too a rainy day to dig on it.

jcldc avatar Oct 13 '22 14:10 jcldc

for the record: the script cash100 in src/nbody/evolve/hackcode/hackcode1 is what I've been using to trigger a crash.

I also just realized zeno's treecode is essentially the same code as hackcode1. Need a snowy day for that.

teuben avatar Oct 13 '22 15:10 teuben

"snowy day " :D

jcldc avatar Oct 13 '22 16:10 jcldc

on an amusing note, past weekend it was raining a lot, and I installed mac in a virtual (QEMU) box, via the sosumi tool. It also fails in this environment. No surprise, since it also died on native mac.

On the other hand, I also tried the zeno 'treecode', and it never crashed. Also ran another 100 compilations of NEMO on an AMD. It did not crash.

teuben avatar Oct 14 '22 02:10 teuben

ran into a case where the bug was also triggered in hackforce, replacing it with hackforce_qp solved it.

Note added: the crash100 script will also make hackcode1_qp to fail eventually.

teuben avatar Mar 06 '23 02:03 teuben

using typedef long atype; instead of using a short, did not resolve the bug.

teuben avatar Mar 14 '23 00:03 teuben

Actually, after a new fresh install of NEMO, I got again the segmentation fault core dumped from hackforce (during io_nemo test suite).

I was able to find out the faulty line : line 125 in src/nbody/evolve/hackcode/hackcode1/grav.c

121 
122 local bool subdivp(nodeptr p,      /* body/cell to be tested */
123                    real dsq)       /* size of cell squared */
124 {
125     if (Type(p) == BODY)                        /* at tip of tree?          */
126         return (FALSE);                         /*   then cant subdivide    */

The debugger says that the pointer on p is not null, but it is not probably pointing on an allocated part of the memory, that's why it crash.

Then I recompiled hackforce by turning off "-O2" option from $NEMOLIB/makedefs, and then no errors (no core dumped) when running hackforce.

Finally I put back "-O2" option in makedefs file, and the error/core dumped vanished !!!! no more core dumped by running hackforce.

That's really really weird.

jcldc avatar Mar 14 '23 12:03 jcldc

was yours a segfault or a bus error? bus error pointed to alignment error, and the body/node has a "short type", which made me suspicious. I made it a long, this didn't fix it. Also tried single precision NEMO, also didnt solve it. It has something to do with casting between body and node, and overlaying those structs (see defs.h).

I've documented some more cases I tested in the crash100 script. Everything is just bizarre. As I said, the mother of all bugs.

and on single precision NEMO the error was a segfault, not bus error as in the default double precision.

teuben avatar Mar 14 '23 12:03 teuben

In my case it was Bus errror (core dumped) but which vanished once I recompiled the code....

jcldc avatar Mar 14 '23 12:03 jcldc

Robert Zhang noted that flipping the quad and subp[] in the cell typedef made it work. This hinted that for hackcode1 it was not including the right .o file, which pointed as a Makefile that was not strict enough.

Thus, consider this bug fixed, a pull request will follow.

teuben avatar Mar 30 '23 03:03 teuben

Robert Zhang noted that flipping the quad and subp[] in the cell typedef made it work. This hinted that for hackcode1 it was not including the right .o file, which pointed as a Makefile that was not strict enough.

Thus, consider this bug fixed, a pull request will follow.

weird...

jcldc avatar Mar 30 '23 07:03 jcldc