qthreads icon indicating copy to clipboard operation
qthreads copied to clipboard

Race Conditions in `qutil`

Open insertinterestingnamehere opened this issue 1 year ago • 2 comments

I'm seeing a handful of thread sanitizer errors about race conditions in the qutil code.

One example: apparently the macro invoked at https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L152, https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L156, and https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L160 includes both a read and a corresponding atomic write from another thread.

Another example: non-atomic read: https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L897 atomic write to the same variable: https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L903

Similar: https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L909 https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L915

Another more unusual one: The macro invoked at https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L170 has a non-atomic write as well as an atomic write to the same variable from a different thread via qthread_syncvar_fill at https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/syncvar.c#L609.

Poking at some of the memory fences for syncvars seems like a partial fix. Some additional atomics in the nemesis scheduler may be necessary to get the sequencing right, but at this point I'm still not sure. The weird thing though is that some of the thread sanitizer tracebacks here seem to be getting garbled and it's not at all clear why/how that's happening. It may be that a better-debugged https://github.com/sandialabs/qthreads/pull/249 is needed to fix these warnings.

Update on this one, I'm fairly sure a lot (all?) of these errors are caused by thread sanitizer not being able to properly follow dependencies through the atomic logic in our various threadqueues. It's fixable, but it'll require an extension beyond the interop in #249.