Tests Segfault During Finalization With Thread Sanitizer Enabled
The following tests fail during finalization with thread sanitizer enabled (clang 17, nemesis scheduler, no topology detection, x86): aligned_writeFF_basic, reinitialization, qthread_stackleft, and sinc_workers. Interestingly, these tests do not fail when run by themselves or with a debugger. They also intermittently succeed. There appears to be some kind of segfault during finalization. At least with aligned_writeFF_basic, it successfully executes the test and then crashes instead of exiting normally. I'm investigating the behavior of the others to get more information.
Yah, in the reinitialization test it only gets to the qthread_finalize call, so something's wrong in there.
Got a backtrace! https://github.com/sandialabs/qthreads/blob/55ad590801b7188aa5f0e34a8946260a2726046c/src/qthread.c#L2011 (yes, the debugger is pointing to a bracket, so some other nearby line is likely the actual culprit) called by https://github.com/sandialabs/qthreads/blob/55ad590801b7188aa5f0e34a8946260a2726046c/src/qthread.c#L704.
My current running theory is that this is an issue where some kind of instrumentation is needed to get thread sanitizer to be okay with our context swapping. See, for example, https://github.com/boostorg/context/issues/124 (as well as the linked LLVM patch there: https://reviews.llvm.org/D54889). Given that, I'm less certain about how to proceed here. I'll look at some other bugs and circle back to this one.
New guess on this one: this is just because of the pretty heavy stack usage from thread sanitizer. Increasing the stack size seems to mitigate it. Currently exploring to see what kind of stack size (if any) can eliminate it entirely.
Okay, adjusting the stack size and some of the problem sizes was enough to "fix" the issue on x86, but it doesn't seem to be enough on ARM. Current best guess is there's still something else wrong, but adjusting the stack/problem sizes can mask the issue on some architectures.
Tracked the segfault in main thread to here: https://github.com/sandialabs/qthreads/blob/55ad590801b7188aa5f0e34a8946260a2726046c/src/qthread.c#L1474. Specifically it segfaults inside the call to pthread_join that segfaults, not during any of the memory accesses for the function parameters. Currently investigating what the other shepherd thread is doing before this happens.
Okay, at least for the reinitialization test, the segfault is happening on return from our swapctxt in the non-main thread. Specifically https://github.com/sandialabs/qthreads/blob/55ad590801b7188aa5f0e34a8946260a2726046c/src/qthread.c#L2285 runs and executes the desired function, but does not return.
Note, even with #231 there are still intermittent crashes in qthread_stackleft and qthread_disable_shepherd. Given that the crashes observed here are consistent I'm fairly sure those are separate bugs though. #144 is probably one of them. The qthread_disable_shepherd failure is new and unique to thread sanitizer though, so I'll open a separate issue for it.