Ceedling Improve segfault handling, fix unit tests on Windows

The Ceedling rspec tests were not passing on Windows (MinGW-W64), primarily because segfaults were not detected correctly. I have improved the segfault detection logic and test suite output as well as fixing a few other small issues I encountered along the way.

Mar 04 '24 05:03 tbates-redarc

@tbates-redarc Thank you for this! There's some good changes and fixes in here. You also helped us see a couple further fixes needed. And thank you for doing the work on the tests.

Your changes aren't all that many, but they touch some sensitive and recently changed areas. We know some further changes we want to make as well. So, rather than directly merge, we're going to try peeling this PR off into a new branch, working on it with you, manually testing it thoroughly, and then merging, if that's okay?

We're in the final stretch for releasing 0.32. These are important fixes you submitted. We need to finish up a handful of other things before returning to your PR.

Mar 06 '24 15:03 mkarlesky

Thanks, happy to support whatever approach works for you. I would love it if you can get #818 into 0.32 as well.

Mar 06 '24 23:03 tbates-redarc

Sorry for the slow response on #818, @tbates-redarc. It's very much on the radar and in the backlog for the final release. You pointed out some very real issues and made some great suggestions. Ceedling has just sorta grown and changed organically over time, and the set up order has not been properly revisited. Since you've suggested a pretty significant change, we need to really spend some time digesting and exercising it. But, rest assured, it will get attention before the release. We'll definitely reach out in a bit.

Mar 08 '24 14:03 mkarlesky

@tbates-redarc Finally working on incorporating your work into the nearly complete next release of Ceedling…

Can you help me understand the segfault logic you wrote in generator.rb? Specifically, I want to understand the check for exitstatus == 3. I am not aware of any specific exit handling in Unity that exits with 3, nor am I aware of a common platform convention for exit code 3. The logic for termsig == 11 makes lots of sense for Unix-like platforms, of course.

Thank you.

May 12 '24 18:05 mkarlesky

@mkarlesky Segmentation faults / access violations in Windows cause an SEH exception. These are used to implement C++ exceptions on Windows. The default behaviour for an unhandled C++ exception is to call the abort() function. On Windows, the abort() function returns an exit code of 3.

The only part of this I'm uncertain of is that SEH exceptions are not identical to C++ exceptions, and some references suggest that an unhandled exception will use the exception code as the exit code. For a STATUS_ACCESS_VIOLATION this would be 0xC0000005. However in my testing using MinGW-W64 and GCC, I got exit code 3. Perhaps on different implementations of the C compiler and/or runtime the outcome might be different.

Also it's worth noting that any call to abort() would cause the process to exit with code 3, so it is not ideal for detecting segfaults in this application. Perhaps a better solution would be for the Unity test runner to register its own signal/exception handler that e.g. prints the same 'Segmentation fault.' text that we're used to from Unix.

May 13 '24 08:05 tbates-redarc

@tbates-redarc Thanks for the prompt reply!

I can confirm that using MSYS on Windows, the exit code for an access violating is 0xC0000005.

More importantly, Unity’s behavior is to emit an exit code with the number of failed tests (0 being success, of course) up to the platform limit on exit codes or any other platform manipulations. So, checking the exit code is problematic no matter what way we go. There's actually a not insignificant method in Ceedling that hardly ever gets used that does some basic validity checks on all this. It's complicated enough that it's never been enabled by default.

We're looking at simply expanding the concept of segfault checking to crash detection generally. We can check for certain conditions on some platforms for definitive evidence of a crash. Otherwise, we'll just look to see if there's no test summary. With high likelihood, the lack of a test summary means something went boom for any of the handful of reasons such things can happen.

May 13 '24 14:05 mkarlesky

@tbates-redarc I have new and improved crash detection working on both Linux and Windows now. I also expanded our Github Action CI setup to run the entire build on both Linux and Windows. So far, it looks pretty good.

I was surprised to find that all the backtrace / gdb features worked in the Windows build given the many Windows-specific changes you made for those features. My suspicion is that the build works because Github Actions default to MSYS on Windows. You were probably running into problems in a less Unix-y Windows setup? Can you briefly outline the changes you made related to the backtrace feature and why you made those changes? It looks like some good thought and some beads of sweat went into it!

May 14 '24 20:05 mkarlesky

@mkarlesky mostly it was about getting nicer and cleaner output from gdb.

As you've already seen, I created a genuine null pointer dereference in the test suite instead of calling raise because access violations on Windows don't cause signals natively (they cause exceptions which MinGW and friends translate into signals).
I found the flattening and unflattening of the gdb output (to avoid the backtrace being matched in GeneratorTestResults#process_and_write_results # process unity output) to be unreliable so I replaced that with a redirection to a file (which also worked differently on Linux and Windows as described in the comments in defaults.rb), so that I could be sure that test output and gdb (backtrace) output would be properly separated. This made for much nicer output as the failing test would show up in the test summary with the file and line number of the crash, without a mangled version of same also showing up in the test output.
The --eval-command run --eval-command backtrace that was on the default gdb command line caused gdb to either a) complain about being asked to exit while the child process was still running after printing the backtrace, or b) complain about being asked for a backtrace for a child process that had already terminated. To suppress these messages I added the backtrace.gdb script because I couldn't make gdb accept the if $_isvoid ($_exitcode) conditional on the command line.
I added a call to fflush to make sure all the child process's output was printed before the backtrace to avoid confusion where a print statement just before a crash appears not to have been run (but it's actually just buffered).
Also I spent way too long being completely baffled by strings showing up frozen in the middle of tests, where there was nothing that could possibly be freezing them, only to finally figure out that Ruby itself freezes the strings in $LOAD_PATH inside rb_construct_expanded_load_path() in load.c, but only when something tries to search the load path (e.g. a call to pp). I put the solution (dup the string before adding it to $LOAD_PATH) in a separate commit.

May 17 '24 07:05 tbates-redarc

@tbates-redarc This is so very helpful.

I have already imported your more general solution to triggering a segfault as well as your shell handling improvements. But, I have not delved into any significant changes to backtrace itself yet.

Backtrace is all a great new feature, but the community submitted code is a bit difficult to follow and work with — especially for layering in some basic improvements we have in mind. Your hard work and these explanations are so helpful to that end!

May 18 '24 14:05 mkarlesky

I think we got it, @tbates-redarc. Your PR was such a helpful reference. Other refactoring and some other features mean rejecting your PR, but its spirit and pieces of its code are now in the pre-release code.

I found a way to “flatten” / ”unflatten” multiline text that is quite a bit simpler and much more reliable than the original. If we can avoid writing to a file only to turn around and use its contents, we should. File I/O slows down a build. More importantly, writing to a file and then reading it back as a type of information exchange between parts of the application can cause trouble in a multi-threaded build (especially on Windows). Sometimes files don't get flushed to the filesystem before other code expects their existence. This approach also eliminates double tool declarations depending on Windows or Linux.

Jun 13 '24 01:06 mkarlesky