sync_workers Hangs Intermittently on Clang/Linux
Noticed this in CI. sync_workers occasionally hangs when building with Clang on Linux. This may be related to https://github.com/sandialabs/qthreads/issues/145, but I'm creating a separate issue for this until that can be confirmed.
Same with syncvar_prodcons.
Same with sync_null
Same with aligned_prodcons.
Actually these failures and those documented in https://github.com/sandialabs/qthreads/issues/151 and https://github.com/sandialabs/qthreads/issues/153 all only show up with clang/Linux/sherwood scheduler when using 'hwloc' or 'binders' topology. Given that the circumstances are so specific it seems like these are probably all related. I'll close those issues in favor of this one.
Interestingly enough, https://github.com/sandialabs/qthreads/issues/145 is not isolated to these circumstances so I'll keep that one separate.
This means that the tests affected (that I've found thus far) are:
hello_world
qthread_incr
qthread_dincr
arbitrary_blocking_operation
sync_workers
syncvar_prodcons
sync_null
aligned_prodcons
This list is likely not exhaustive since many of these failures are very intermittent, but it's a good starting point.
Interestingly, I am seeing the sync_workers hang with the nemesis scheduler when the "paranoia" checks are enabled, so maybe there's some more info there too.
Note: enabling assertions makes these bugs disappear for some reason.
Just saw qthread_incr hang on OSX/M1. This is with the Nemesis scheduler. Probably related, but it might still be a separate bug.
I think I just fixed the sync_workers failure while debugging an undefined sanitizer error (that or it just became dramatically more infrequent). As far as I can tell it was just caused by some bad synchronization in the test itself that was allowing garbage values into the computation and allowing it to sometimes hang instead of terminating correctly. This would mean that the other intermittent hangs in other tests may not be caused by the same code internal to the library and instead be caused by similar API misuse in those tests too.
The consistent failure on ARM documented in https://github.com/sandialabs/qthreads/issues/164 is still present.
Likely related: the aligned_prodcons test crashes with the thread sanitizer enabled (clang 17, nemesis scheduler, no topology detection, x86-64). Oddly, it does so some time during the cleanup stage, so the test succeeds but then something during the program finalization crashes. Removing the call to qthread_readFF from within the main thread eliminates the crash. Unfortunately the crash only happens when the test is run within our test suite and notably not when it's run in a debugger. As best I can tell it's not crashing within any of the functions we're registering via atexit which is bizarre.
Actually I'm seeing the weird at exit failure with thread sanitizer with a handful of other tests and they're distinct from the ones that hang, so my best guess is that these are distinct bugs. I'll put up a separate issue for that.
Haven't seen this one show up in a while. I'm assuming some other fix resolved it along the way.