Summary

task_group::wait() may return earlier than the tasks queued with ::run() are finished.

Version

2022.0.0 and up to the latest master - affected 2021.13.0 - works

Might be related to commit 1f52f5093ec7?

Environment

Intel Alderlake CPU (16 perf threads, 8 energy eff threads) Windows 11 (24H2 currently, although happens on olders as well) ClangCL 19.1.1 from Visual Studio 2022 17.14 (earlier versions suffer, too) C++20 application (game engine), all tasking and threading is done only via oneTBB

Observed Behavior

Consider the following simple example:

// serial code

parallel_invoke(
    []{ render_task(); },
    []{ service_task(); },
);

// serial code that depends on what render_task() and service_task() do

This works as expected on any version. When parallel_invoke() returns, all the required data is processed and available.

Now we change this to:

// serial code

task_group tg;
tg.run([]{ render_task(); });
tg.run([]{ service_task(); });
tg.wait();

// serial code that depends on what render_task() and service_task() do

According to the documentation, they must behave roughly the same way. But time to time, on versions newer than 2021.13.0 I get crashes in the code after tg.wait(), with the dumps showing me that the data render_task() and service_task() must've processed is not ready yet.

However, W/A like this:

// serial code

std::barrier wa(3);
task_group tg;

tg.run([&wa]{ render_task(); wa.arrive(); });
tg.run([&wa]{ service_task(); wa.arrive(); });
tg.wait();
wa.arrive_and_wait();

// serial code that depends on what render_task() and service_task() do

works just fine. This made me think tg.wait() exits prematurely.

Where possible, I use parallel_invoke(), parallel_for() etc., but sometimes I just need to fire an async task and then wait for the results in a different part of the code, so I anyway have to use task_groups (at least until task_arena implements per-task waiters)

Expected Behavior

The code below tg.wait() must be run only after the task group finished processing all of the tasks.

Steps To Reproduce

The examples above should be enough, considering that the tasks that are run via tg.run() will take some time.

Feb 17 '25 20:02 solbjorn

Oh, another interesting observation that may help:

Turned out I forgot to copy tbb12.dll to engine's folder after rebuilding oneTBB. I only copied the engine after I rebuilt it with the headers from 2021.13.0. Headers from 2021.13.0 + dll from 2022.1.0 also work. So looks like the issue is somewhere in the headers...

Feb 18 '25 18:02 solbjorn

Hi @solbjorn, thanks for reporting the issue. I have tried to reproduce it, but unfortunately, I was not able to do that. Here is the code I used based on the scenarios you provided:

#include <oneapi/tbb/task_group.h>
#include <chrono>
#include <thread>
#include <iostream>

int main() {
    int* render_data = nullptr;
    int* service_data = nullptr;

    auto render_task = [&] {
        std::this_thread::sleep_for(std::chrono::seconds(20));
        render_data = new int(100);
    };
    auto service_task = [&] {
        std::this_thread::sleep_for(std::chrono::seconds(20));
        service_data = new int(200);
    };

    tbb::task_group tg;

    tg.run(render_task);
    tg.run(service_task);
    tg.wait();

    std::cout << "Render data = " << *render_data << std::endl;
    std::cout << "Service data = " << *service_data << std::endl;

    delete render_data;
    delete service_data;
}

I have used clang-cl 18.1.8 from Visual Studio 2022 17.12.0. It does not perfectly match the compiler you have reported, but you mentioned that earlier versions are affected as well. Could you please share with us more details about your setup to help us reproduce this on our side? As far as I understand, you build TBB from source. Can you please provide the exact CMake settings you use. Thank you for your help.

Feb 19 '25 13:02 kboyarinov

@solbjorn what combination of headers + dll shows the failure? I assume that 2022.x dll is used at runtime. Did you do a partial recompile of your application/engine and so could some components be compiled against against 2022.x TBB headers but others compiled against TBB 2021.13.0 or earlier headers?

Feb 19 '25 15:02 vossmjp

@vossmjp, the issue happens with the headers from any oneTBB version newer than 2021.13, dll doesn't matter.

@kboyarinov, CMake flags are default, just without building tests. I manually add -march=alderlake, but without it the issue is still present.

Since you currently can't reproduce it, let me collect more details within a couple hours.

Feb 19 '25 18:02 solbjorn

Ok, I modified the example above and compiled it separately to make sure the issue can be reproduced outside the game engine.

Output when everything works:

Rendering took 2.008901 seconds.
Service took 5.009844 seconds.
tg.wait() took 5.010622 seconds.
Render data = 100
Service data = 200

"Rendering" is a lambda which sleeps for 2 seconds and then does new int(100), "service" sleeps for 5 and does the same for 200.

After I revisited the compilation flags, it turned out that if the app which uses oneTBB 2022+ is compiled with -fwhole-program-vtables (requires IPO/LTO), then tg.wait() may fail:

tg.wait() took 0.000208 seconds.
Rendering took 2.946946 seconds.
<end of output>

If the app uses oneTBB 2021.13 or older version, then it's safe to compile it even with -fwhole-program-vtables. Flags used during the oneTBB compilation don't matter here, only the target program.

I realize that it's developer's responsibility to pick optimal and safe compilation flags for his program, so closing the issue would be valid. OTOH, the engine uses around 15 external libs and has about 1.5 mln lines of own code and amongst all this only tg.wait() from 2022+ fails with this flag, the whole rest (including oneTBB) works just as expected. So maybe something in the task_group code is wrong there somewhere...

Feb 19 '25 20:02 solbjorn

I was finally able to reproduce the issue while setting /O2 -fwhole-program-vtables -flto -fuse-ld=lld. It seems like the issue is caused by applying a devirtualization optimization to the class wait_context_vertex that is used for reference counting in the task_group As it is stated here, I have tried to explicitly mark this class with ``[[clang::lto_visibility_public]]` and it fixes my local test. I have created a draft PR https://github.com/uxlfoundation/oneTBB/pull/1644 with these changes. Could you please try this version on your side?

Feb 20 '25 13:02 kboyarinov

Yes, I confirm this fix works, thanks a lot!

Interestring though, why sometimes such annotations are needed in order to work properly.

Feb 20 '25 20:02 solbjorn

Tried Clang 20.1.0 released 2 days ago: unfortunately, without the changes from the linked PR, the result is the same (tg.wait() exits without waiting). So I think this visibility hint you added is still needed and expected from the compiler PoV.

Mar 08 '25 16:03 solbjorn

Looking the code at https://github.com/uxlfoundation/oneTBB/blob/e70215cceff846b2cbabc3f687a757dea96f9fa2/include/oneapi/tbb/detail/_task.h#L115 if multiple threads are writing to m_ref_count then the value for r at line 119 might not be correct i.e. if another thread calls wait_context.reserve() between lines 115 and 119 when r == 0 we may also exit early.

I think the devirtualization might be a red herring here and the optimization just makes that ordering issue happen more often.

I'm looking into this class of issues and it looks suspiciously like https://github.com/uxlfoundation/oneTBB/pull/417. It also seems like this would affect 2021.x versions because this ordering of reads/writes happens there also

May 14 '25 14:05 kwasimensah

@kwasimensah,

Thanks for highlighting this. The only scenario when new tasks can be added to task_group between lines 115 and 119 are external (to the task_group) threads submitting work to the task_group. In this case yes, it can create a logical race between wait in one thread and run on the other. The only way to avoid this logical race is to wait for task_group completion on each thread that is external to task_group:

tbb::task_group tg;

void thread2_function() {
    tg.run(thread2_task);
    // tg.wait() is required to ensure thread2_task is completed
}

void thread1_function() {
    tg.run(thread1_task);
    tg.wait();
}

If thread2_task is not runed when thread1_task is completed, the thread1_task decreases the reference counter to 0 and tg.wait() in thread1_function can be considered completed. In this case, thread2_task is considered added after exiting the wait, so separate wait in thread2_function is required.

If thread2_function() is executed from inside the task_group task, separate wait is not needed since the currently executing task guarantees that wait cannot complete while adding new tasks. The example above only describe the external (to the group) thread in the same arena.

Jun 03 '25 10:06 kboyarinov

task_group: seems like wait() may return prematurely

Summary

Version

Environment

Observed Behavior

Expected Behavior

Steps To Reproduce