Is task_group::wait() thread-safe?
Is task_group::wait() thread-safe? Suppose we have a shared tbb::task_group that is used to schedule some work but then multiple threads call wait() on the same task_group.
I can't find anything in the docs that mention this but in the implementation, I see comments that suggest that this is not the case.
task_group_status wait() {
bool cancellation_status = false;
try_call([&] {
d1::wait(m_wait_ctx, context());
}).on_completion([&] {
// TODO: the reset method is not thread-safe. Ensure the correct behavior.
cancellation_status = m_context.is_group_execution_cancelled();
context().reset();
});
return cancellation_status ? canceled : complete;
}
So the follow up question (assuming the TODO comment above is correct), what is the recommended workaround?
@e4lam, you are right that we need to clarify in the documentation and also improve the implementation.
As currently implemented:
- Any task that is run by a thread T in a
task_groupG before a call toG.wait()by the same thread T will be complete before the call by T towaitreturns. Stated more simply, a thread is guaranteed to at least wait for the completion of all tasks it has submitted prior to its call to wait. - There can be races between
runandwaitin different threads, since a call torunon a thread T may not happen before the call towaiton a thread S. This is not lack of thread safety, simply the ordering of calls in different threads. - However, as you have noticed,
waithas possible thread safety issues if thetask grouphas been explicitly canceled or there is an exception. Calls towaiton different threads can get differenttask_group_statusvalues returned, sinceresetmay be called by thread S before the status value is read by a thread T. Inreset, which is called bywait, there is also the potential for a double free of the exception.
Since you have raised this issue, we plan to improve our documentation to more accurately reflect the current implementation and also look at ways to improve thread safety in the case of explicit cancelation or exceptions.
Thank you for the thorough answer!
For completeness, I should mention this came about as I was trying to decipher what/why/how OpenUSD was hanging for me here: https://github.com/PixarAnimationStudios/OpenUSD/blob/ceab7e6541c02c9376a03de84bb1174e95e7264e/pxr/base/work/dispatcher.cpp#L55
I haven't managed to make any head way on a proper fix as it's taking me a bit to be able to get a debug build up and running. But I'll mention the reference anyways because it's interesting to see what callers have resorted to. :)
See PR #1780
I note that PR #1780 merely says "more" thread-safe. But it doesn't mean that it's now actually thread-safe? If that is the case, is more work planned here?