oneTBB icon indicating copy to clipboard operation
oneTBB copied to clipboard

Is task_group::wait() thread-safe?

Open e4lam opened this issue 6 months ago • 4 comments

Is task_group::wait() thread-safe? Suppose we have a shared tbb::task_group that is used to schedule some work but then multiple threads call wait() on the same task_group.

I can't find anything in the docs that mention this but in the implementation, I see comments that suggest that this is not the case.

    task_group_status wait() {
        bool cancellation_status = false;
        try_call([&] {
            d1::wait(m_wait_ctx, context());
        }).on_completion([&] {
            // TODO: the reset method is not thread-safe. Ensure the correct behavior.
            cancellation_status = m_context.is_group_execution_cancelled();
            context().reset();
        });
        return cancellation_status ? canceled : complete;
    }

So the follow up question (assuming the TODO comment above is correct), what is the recommended workaround?

e4lam avatar Jun 20 '25 19:06 e4lam

@e4lam, you are right that we need to clarify in the documentation and also improve the implementation.

As currently implemented:

  • Any task that is run by a thread T in a task_group G before a call to G.wait() by the same thread T will be complete before the call by T to wait returns. Stated more simply, a thread is guaranteed to at least wait for the completion of all tasks it has submitted prior to its call to wait.
  • There can be races between run and wait in different threads, since a call to run on a thread T may not happen before the call to wait on a thread S. This is not lack of thread safety, simply the ordering of calls in different threads.
  • However, as you have noticed, wait has possible thread safety issues if the task group has been explicitly canceled or there is an exception. Calls to wait on different threads can get different task_group_status values returned, since reset may be called by thread S before the status value is read by a thread T. In reset, which is called by wait, there is also the potential for a double free of the exception.

Since you have raised this issue, we plan to improve our documentation to more accurately reflect the current implementation and also look at ways to improve thread safety in the case of explicit cancelation or exceptions.

vossmjp avatar Jun 24 '25 18:06 vossmjp

Thank you for the thorough answer!

For completeness, I should mention this came about as I was trying to decipher what/why/how OpenUSD was hanging for me here: https://github.com/PixarAnimationStudios/OpenUSD/blob/ceab7e6541c02c9376a03de84bb1174e95e7264e/pxr/base/work/dispatcher.cpp#L55

I haven't managed to make any head way on a proper fix as it's taking me a bit to be able to get a debug build up and running. But I'll mention the reference anyways because it's interesting to see what callers have resorted to. :)

e4lam avatar Jun 25 '25 01:06 e4lam

See PR #1780

vossmjp avatar Aug 25 '25 20:08 vossmjp

I note that PR #1780 merely says "more" thread-safe. But it doesn't mean that it's now actually thread-safe? If that is the case, is more work planned here?

e4lam avatar Aug 26 '25 19:08 e4lam