McCode icon indicating copy to clipboard operation
McCode copied to clipboard

Limit on total number of McStas monitors (segfault)

Open mads-bertelsen opened this issue 3 years ago • 15 comments

There seems to be an upper limit on the number of monitors allowed in a single instrument, which is much lower than the total component limit. On McStas 2.5 I found it to be a maximum of 377, with anything above producing a segfault at runtime. I have attached an instrument that use 378 PSD_monitors which fails in this manner (txt ending as GitHub didn't want instr).

n_component_test.txt

I have tried combinations of different monitors too, as long as the total reaches more than 377 the segfault occurs. The number of histogram bins in each monitor does not impact the total allowed number of monitors.

mads-bertelsen avatar Jan 14 '21 14:01 mads-bertelsen

Thanks for the bug report @mads-bertelsen. As discussed elsewhere already, I believe this may be system dependent. I have successfully fun the instrument on macOS 11 Big Sur with McStas 2.5, 2.6.1, 2.7 and 3.0.

My compiler is

Apple clang version 12.0.0 (clang-1200.0.32.28)
Target: x86_64-apple-darwin20.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I will be back with more observations later :-)

willend avatar Jan 14 '21 14:01 willend

Also works OK on Catalina with 2.5-generated code and

Apple clang version 12.0.0 (clang-1200.0.32.28)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

willend avatar Jan 14 '21 15:01 willend

I have installed myself a fresh macOS 11.13 High Sierra VM, and loaded up a McStas 2.5. On this system I have Screen Shot 2021-01-14 at 20 50 19

I can also not reproduce the issue here??

willend avatar Jan 14 '21 19:01 willend

It does however seem that my compiler is slightly different still from what @mads-bertelsen initially reported:

Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

which seems to be the "full" Xcode rather than the slim "command line tools". Will now try loading a full Xcode next

willend avatar Jan 14 '21 19:01 willend

Ah! Now reproduced without updating Xcode - with:

mcstas n_component.txt
clang -lm n_component.txt.c
./a.out

The default McStas CFLAGS include compiler optimisation to level -O2 which is what I tested with before - and where the segfault did NOT occur above... Revisiting Catalina and Big Sur...

willend avatar Jan 14 '21 20:01 willend

OK, now also reproduced on Big Sur for the 2.x releases, 3.0 does not behave this way: Screen Shot 2021-01-14 at 21 11 24

  • and the moment one uses the -O2 option used by default in McStas the issue goes away...

willend avatar Jan 14 '21 20:01 willend

Interesting, sorry I didn't mention I was running mcstas without the -O2 flag, was purely because of the long compile times for testing. Surprising that -O2 somehow fixes the problem, would think -O2 introduced risk of issues rather than fixing them.

mads-bertelsen avatar Jan 15 '21 06:01 mads-bertelsen

@mads-bertelsen agreed, this is a bit surprising. But maybe (by sheer luck) -O2 clears out e.g. some unused vars/arrays that are part of the equation?

I tried running on Linux with gcc, no issue there - so looks quite platform dependent. @ebknudsen is quite good at debugging with gdb, but I believe this is not available on Mac, so to really figure out what is going on I guess we really ought to learn ourselves some https://lldb.llvm.org...

willend avatar Jan 15 '21 09:01 willend

Or maybe some Valgrind could hint in the direction of some issues? On the other hand, that is mostly for dynamic arrays etc., right? (Which maybe are not really the be the issue here: McStas 2.5 has static arrays in the PSD's, 2.6.1/2.7 uses dynamic ones...)

willend avatar Jan 15 '21 09:01 willend

Since it only happens in monitors, it is clear DETECTOR_OUT macros are involved. Removing these from the SAVE section fixes the problem. Digging into these in mccode-r.c I found that specifically it is the call to

detector = mcdetector_statistics(detector);

Commenting this line out of the mcdetector_import function fixes the issue. The strange thing is that if I leave it in, but return from that function immediately using

return(detector);

The segfault still happens. It should bypass the function completely. I can't even have that function print a single line, so the error somehow happens in the call of the function itself rather than in the function.

EDIT: It seems earlier printf statements are also not printed when a segfault occurs, so maybe its not directly in the call. The failure is connected to mcdetector_statistics though, since it works without it.

mads-bertelsen avatar Jan 15 '21 10:01 mads-bertelsen

on a side note: you can't always trust printf's to be echoed to the screen in case of segfaults, as they are buffered, and the program may exit before the buffer is flushed.

ebknudsen avatar Jan 15 '21 11:01 ebknudsen

I have found a surprising fix by modifying mccode-r.c

EDIT: This fix only extend the limit with a few extra monitors, not enough for our use-case. It may serve as a indicator of what the problem is.

The function mcdetector_statistics takes a MCDETECTOR and returns an MCDETECTOR, but its the call with an MCDETECTOR input at that stage which causes the crash. Behind the senses a copy operation happens. For some reason that fails. I changed the function to take a pointer, and declare a new MCDETECTOR by starting the function like this:

MCDETECTOR mcdetector_statistics(
  MCDETECTOR *p_detector)
{
  MCDETECTOR detector;
  detector = *p_detector;

And changed the call to

detector = mcdetector_statistics(&detector);

Now everything seems to work normally. Have attached the new mccode-r.c. I am unsure if this is the appropriate way to fix this issue, so haven't started a branch with a fix. mccode-r.txt

It may be bad practice to write all these mcdetector functions to take this large MCDETECTOR struct and return it, each time making a big unnecessary copy operation instead of just providing pointers to the existing MCDETECTOR struct.

mads-bertelsen avatar Jan 16 '21 10:01 mads-bertelsen

Continued debugging of the mcdetector_import makes it clear something strange is going on. At high monitor counts, these innocent looking lines with compatible types (long in struct, long given to function) is causing the crash:

detector.n    = n; 
detector.p    = p;
detector.m    = m;

By printing values for detector.n, detector.p and detector.m, it became clear that before the above code that sets them for this detector, they are somehow initialized to the values of the previous detector instance. I don't understand how that can be. A new struct is declared with static variables, and some of these already have values that were set by a previous function call that declared another version of that struct.

EDIT: Here I comment all parts of mcdetector_import after the above lines but excluding the last return statement to understand exactly where the crash occurs.

mads-bertelsen avatar Jan 18 '21 08:01 mads-bertelsen

Just to be clear - is this before or after the above fix to mccode-r.c was applied?

ebknudsen avatar Jan 18 '21 10:01 ebknudsen

In the above I didn't even run the mcdetector_statistics function, at high enough monitor counts the crash happens without it. Added an edit showing that I commented out the remainder of the code in the function, including that call.

mads-bertelsen avatar Jan 18 '21 10:01 mads-bertelsen