oneTBB tbb on wasm always executed on the main thread.

On the wasm platform Both tbb::task_group tg and tbb::parallel_for are always executed on the main thread. But std::thread executes on an asynchronous thread. What causes this?

and oneapi::tbb::info::default_concurrency() > 10

Jan 01 '24 15:01 jellychen

================== This is my situation: I have ported OpenVDB to the web platform, and OpenVDB relies on TBB for its multi-threaded implementation. The porting process went smoothly, of course. However, the performance test results were unexpectedly low.

After several comparisons, some phenomena were discovered. My function is to perform voxel processing through VDB. I encapsulated the functionality within a function.

==================

When I first call this function, the CPU does not exceed 100%. It is evident that at this point TBB does not utilize the performance of multiple cores.
When I call this function for the second time, the CPU usage is approximately 200%, which means it can utilize the parallelism of two CPU cores. My computer has 8 cores. This results in a performance improvement of twice as much compared to the previous execution.
When I called the function for the third and fourth time, I noticed that the CPU usage of this function can reach around 780%, and the overall execution time is approximately 7.5 times faster. Therefore, it can be concluded that at this point, TBB can effectively utilize the features of multiple cores.

================== To summarize, with the same code and execution environment, the only difference lies in the order of execution. TBB exhibits different multicore utilization on wasm.

It seems that TBB needs a warm-up. So I made some changes. I compiled the code using emscripten and added -sPTHREAD_POOL_SIZE=(navigator.hardwareConcurrency), but there doesn't seem to be any difference in performance. Do you have any similar

Jan 03 '24 05:01 jellychen

================== But I conducted an experiment using std::thread, and the code is roughly like the following.

static std::vector<std::thread> threads;
    for (int i = 0; i < 8; ++i) {
        auto a= std::thread([]() {
            for (;;) {
                ;
            }
        });
    threads.emplace_back(std::move(a));
    }

In this code snippet, threads can directly make use of the multi-core features without the need for pre-warming like in TBB. I wonder if there is any way to bypass this issue or adjust some mechanisms in TBB.

Jan 03 '24 06:01 jellychen

=============== I have conducted some research, and I modified the source code of TBB by adding logging for thread creation in the rml_thread_monitor.h file. Through analyzing the logs, I discovered that only a small number of threads (around 2) were created during the first phase of execution. Therefore, this is not an inherent issue with wasm

=============== Due to the complexity of TBB's mechanism, I haven't conducted in-depth research on it yet. It could be some differences in multi-threaded semaphore or synchronization mechanisms on the web platform that are causing this issue. However, I can generally confirm that it is an inherent problem with TBB

Jan 03 '24 07:01 jellychen

But I have found a possible solution, which is to execute the following code segment after the program starts, acting as the warm-up code for

===============

        {
#pragma optimize("", off)
            auto concurrency = std::thread::hardware_concurrency();
            if (concurrency > 1) {
                tbb::task_arena arena;
                arena.initialize(concurrency, 1, tbb::task_arena::priority::high);
                int start = 0, len = concurrency * 5;
                for (int i = 0; i < concurrency; ++i) {
                    tbb::parallel_for(start, len, [](size_t i) {
                    // printf("thread id %d\n", std::this_thread::get_id());
                    });
                }
            }
#pragma optimize("", on)
        }

I found that executing this nearly ineffective code ahead of time enables subsequent OpenVDB to efficiently utilize multi-core computing.

Jan 03 '24 09:01 jellychen

Hi, Did you face this issue with TBB prior to your porting to WASM? As you have said - it doesn't seem to be a WASM issue, but an inherent TBB issue. I will investigate this further and keep you updated.

Jan 03 '24 16:01 JhaShweta1

Hi, Did you face this issue with TBB prior to your porting to WASM? As you have said - it doesn't seem to be a WASM issue, but an inherent TBB issue. I will investigate this further and keep you updated.

I have been using it in non-web scenarios, mainly on macOS, and it works well

Jan 03 '24 16:01 jellychen

HI @JhaShweta1

============================= I conducted the same experiment on OpenSubdiv, which is a geometry algorithm library specifically designed for mesh subdivision. I discovered some strange phenomena.

The phenomenon is that using TBB (Threading Building Blocks) for computation is much slower than using a single thread, approximately three times slower.

============================= The rough process is as follows.

Just like the previous method, warm up TBB by using the code snippet below.

{
#pragma optimize("", off)
            auto concurrency = std::thread::hardware_concurrency();
            if (concurrency > 1) {
                tbb::task_arena arena;
                arena.initialize(concurrency, 1, tbb::task_arena::priority::high);
                int start = 0, len = concurrency * 5;
                for (int i = 0; i < concurrency; ++i) {
                    tbb::parallel_for(start, len, [](size_t i) {
                    // printf("thread id %d\n", std::this_thread::get_id());
                    });
                }
            }
#pragma optimize("", on)
        }

OpenSubdiv extensively utilizes tbb::parallel_for for parallel execution of kernel functions. To ensure the effective utilization of TBB's multithreading, I simulated the invocation of tbb::parallel_for externally beforehand, guaranteeing that each callback function of tbb::parallel_for indeed occurs on different threads.

The subsequent performance of the normal CPU utilization rate will never exceed 100%, which is quite peculiar. As a result, there is a significant decrease in performance compared to the single-threaded version without using TBB.

============================= I conducted repeated experiments with the same code on a Mac system, and the conclusion is that using TBB (Threading Building Blocks) effectively utilizes the multi-core capabilities. The code is at least 3 to 5 times faster than the single-threaded version. I am using an 8-core device.

============================= Maybe these phenomena can help you make better judgments. As far as my results are concerned, the overall effect is unsatisfactory, possibly due to the instability of the Wasm platform itself.

Jan 04 '24 06:01 jellychen

Hi, I am also encountering similar issues, but only for nodejs and not in the browser. For nodejs 18 with --experimental-wasm-threads flag set, it occasionally works (same file, different runs have different characteristics). For nodejs 20/21, I cannot set --experimental-wasm-threads and it cannot utilize multiple threads.

https://github.com/elalish/manifold/pull/653#issuecomment-1894948279

Mar 13 '24 04:03 pca006132

elalish/manifold#653 (comment)

There seems to be no way

Mar 14 '24 03:03 jellychen

I wonder if this is something related to the scheduler in tbb, not familiar with the internals so cannot say much. I can try to create a MRE and detailed environment information (emscripten, browser, nodejs version) if that helps.

Mar 14 '24 04:03 pca006132

Hi, Yes, Please share reproducer and environment details. I tried a couple of things suggested by Emscripten previously but it didn't seem to work.

Mar 14 '24 04:03 JhaShweta1

Sure, but this will take some time as I am busy with other things right now. Debugging this wasm weirdness takes quite a lot of time... Hopefully I have more time next week to do this.

Mar 14 '24 04:03 pca006132

Consider the following code:

#include <chrono>
#include <iostream>
#include <thread>

#include "oneapi/tbb/parallel_for.h"

using namespace std::chrono_literals;

int main() {
  auto start = std::chrono::high_resolution_clock::now();

  oneapi::tbb::parallel_for(  //
      oneapi::tbb::blocked_range<std::size_t>(0, 10), [&](const auto &r) {
        std::this_thread::sleep_for(1s);
        auto end = std::chrono::high_resolution_clock::now();
        std::cout << "worker: "
                  << std::chrono::duration_cast<std::chrono::milliseconds>(
                         end - start)
                         .count()
                  << std::endl;
      });
  return 0;
}

Examples results:

worker: 1001
worker: worker: 1005
1005
worker: worker: 1006
1006
worker: 1040
worker: 1058
worker: 1066
worker: 1068
worker: 1069

The results are close to 1000, indicating this is indeed running in multiple threads. However, the CPU utilization never exceeds 100% for compute heavy workload:

#include <chrono>
#include <iostream>
#include <thread>

#include "oneapi/tbb/parallel_for.h"

using namespace std::chrono_literals;

int main() {
  auto start = std::chrono::high_resolution_clock::now();

  oneapi::tbb::parallel_for(  //
      oneapi::tbb::blocked_range<std::size_t>(0, 10), [](const auto &r) {
        long long steps = 0;
        for (long long i = 2; i < 1000000000000; i++) {
          long long n = i;
          while (n != 1) {
            if (n % 2)
              n = (3 * n + 1) / 2;
            else
              n /= 2;
            steps++;
          }
        }
        std::cout << "good " << steps << std::endl;
      });
  return 0;
}

time node a.js
node a.js  6.22s user 0.03s system 101% cpu 6.147 total

# emcmake cmake -DCMAKE_BUILD_TYPE=Release -DEMSCRIPTEN_SYSTEM_PROCESSOR=web ..
cmake_minimum_required(VERSION 3.11)
project(test)

include(FetchContent)
set(TBB_TEST OFF CACHE INTERNAL "" FORCE)
set(TBB_STRICT OFF CACHE INTERNAL "" FORCE)
FetchContent_Declare(TBB
    GIT_REPOSITORY https://github.com/oneapi-src/oneTBB.git
    GIT_TAG        v2021.11.0
)
FetchContent_MakeAvailable(TBB)

set(CMAKE_CXX_FLAGS "-pthread")
set(CMAKE_EXE_LINKER_FLAGS "-pthread -sPTHREAD_POOL_SIZE=4 -sINITIAL_MEMORY=1gb")

add_executable(a a.cpp)
target_link_libraries(a PUBLIC TBB::tbb)
target_link_options(a PUBLIC -pthread)

Emscripten version: 3.1.47
node version: v21.6.2

Mar 15 '24 13:03 pca006132

I also found the same issue in my project. Since we are constrained to the web js, we also made a reproducible docker environment for that case:

git clone [email protected]:josephholten/em-multi.git
cd em-multi
docker build -t em-multi .
docker run -d -p 8080:8080 em-multi
firefox -new-tab localhost:8080
# any update on the website (F5) will reproduce the results in the console (Ctrl + Shift + C)

Note: this starts docker in detached mode. You need to stop it manually. If you didn't run any other docker image, just remove the last one with docker stop $(docker ps -lq).

Output:

filling random vectors...
calculating sequential scalarproduct...
using thread: 131060
seq scalarprod: 16779532.297833
seq time: 170ms
calculating cpp_threads scalarproduct...
cpp_threads concurrency: 4
using thread: 1074151888
using thread: 1073948128
using thread: 1074016056
using thread: 1074083968
cpp_threads scalarprod: 16779532.297837
cpp_threads time: 64ms
calculating tbb_threads scalarproduct...
tbb_threads concurrency: 8
using thread: 131060
tbb_threads scalarprod: 32815.633438
tbb_threads time: 103ms

As you can see, multiple threads are possible in the same C++ program but TBB scheduler still manages to bind tasks to the main thread.

Apr 12 '24 13:04 SoilRos

Certainly, TBB WebAssembly (WASM) is very unstable, but some open-source projects depend on it. It seems like the official team doesn't pay much attention to the bugs discussed. I wonder if we should consider abandoning this library in the future.

Apr 13 '24 17:04 jellychen

Hi All, sorry to hear you had such problem. Our team are not yet experts in WASM. We are also new to this technology so it takes longer time for us to react to such problems. Talking about the issue, at the first glance I was thinking that there is not enough time for TBB to wake-up all the threads and main thread finishes parallel region before workers can join (wake-up mechanism is not serial and main thread will wake at max 2 threads and each of them in turn will wake at max 2). But with provided log information seems that threads did join the parallel region (the most accurate way to check it is to use thread_id or thread_local variable). So TBB scheduler utilizes available concurrency but for some reason System or WASM scheduler don't allocate CPU time for these threads so they are executed serially. It's in turn really bizarre because you saw that sometimes there bigger system utilization. @jellychen could you please confirm that with std::thread CPU utilization is always all the cores? May be threads creation in TBB lacks some flags that prevents threads from parallel execution.

Apr 15 '24 14:04 pavelkumbrasev

@pavelkumbrasev I think this may be related to issue #1341. I tried to add an observer to at least log the entry point of the threads and found out error stated in #1341. Once solved, I found that the observer hooks are being called after all the parallel loops are invoked (see https://github.com/oneapi-src/oneTBB/issues/1341#issuecomment-2054095330 for more details). I think that there must be a bug during the thread initialization related to my comment in that issue.

Apr 15 '24 14:04 SoilRos

Hi All, sorry to hear you had such problem. Our team are not yet experts in WASM. We are also new to this technology so it takes longer time for us to react to such problems. Talking about the issue, at the first glance I was thinking that there is not enough time for TBB to wake-up all the threads and main thread finishes parallel region before workers can join (wake-up mechanism is not serial and main thread will wake at max 2 threads and each of them in turn will wake at max 2). But with provided log information seems that threads did join the parallel region (the most accurate way to check it is to use thread_id or thread_local variable). So TBB scheduler utilizes available concurrency but for some reason System or WASM scheduler don't allocate CPU time for these threads so they are executed serially. It's in turn really bizarre because you saw that sometimes there bigger system utilization. @jellychen could you please confirm that with std::thread CPU utilization is always all the cores? May be threads creation in TBB lacks some flags that prevents threads from parallel execution.

After testing, it has been found that std::thread can utilize all the cores in almost all scenarios.

Apr 17 '24 02:04 jellychen

@jellychen, I'm not really familiar with WASM work model. Is there a chance you can print threads stacks during parallel section execution where CPU utilization is equal to 1 thread running so we can see if threads are sleeping in thread pool for some reason or their stacks are also involved in computation?

Apr 17 '24 07:04 pavelkumbrasev

Same behavior, recompiled in debug mode, got this

Assertion node(val).my_prev_node == &node(val) && node(val).my_next_node == &node(val) failed (located in the push_front function, line in file: 135)
Detailed description: Object with intrusive list node can be part of only one intrusive list simultaneously
...
$tbb::detail::r1::assertion_failure_impl(char const*, int, char const*, char const*) @ a.out.wasm:0x5e516
$tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0::operator()() const @ a.out.wasm:0x5e443
$void tbb::detail::d0::run_initializer<tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0>(tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0 const&, std::__2::atomic<tbb::detail::d0::do_once_state>&) @ a.out.wasm:0x5e00b
$void tbb::detail::d0::atomic_do_once<tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0>(tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0 const&, std::__2::atomic<tbb::detail::d0::do_once_state>&) @ a.out.wasm:0x5df97
$tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*) @ a.out.wasm:0x5de7c
$tbb::detail::r1::intrusive_list_base<tbb::detail::r1::intrusive_list<tbb::detail::r1::thread_dispatcher_client>, tbb::detail::r1::thread_dispatcher_client>::push_front(tbb::detail::r1::thread_dispatcher_client&) @ a.out.wasm:0x85922
$tbb::detail::r1::thread_dispatcher::insert_client(tbb::detail::r1::thread_dispatcher_client&) @ a.out.wasm:0x85505
invoke_vii @ a.out.js:4760
$tbb::detail::r1::thread_dispatcher::register_client(tbb::detail::r1::thread_dispatcher_client*) @ a.out.wasm:0x852b5
$tbb::detail::r1::threading_control_impl::publish_client(tbb::detail::r1::threading_control_client, tbb::detail::d1::constraints&) @ a.out.wasm:0x94d5f
$tbb::detail::r1::threading_control::publish_client(tbb::detail::r1::threading_control_client, tbb::detail::d1::constraints&) @ a.out.wasm:0x97e32
$tbb::detail::r1::arena::create(tbb::detail::r1::threading_control*, unsigned int, unsigned int, unsigned int, tbb::detail::d1::constraints) @ a.out.wasm:0x1dd0d
$tbb::detail::r1::governor::init_external_thread() @ a.out.wasm:0x3d192
$tbb::detail::r1::governor::get_thread_data() @ a.out.wasm:0x1e4a6
$tbb::detail::r1::allocate(tbb::detail::d1::small_object_pool*&, unsigned long) @ a.out.wasm:0x69e32
$tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>* tbb::detail::d1::small_object_allocator::new_object<tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>, tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&, tbb::detail::d1::small_object_allocator&>(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&, tbb::detail::d1::small_object_allocator&) @ a.out.wasm:0x7d5d
$tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&, tbb::detail::d1::task_group_context&) @ a.out.wasm:0x78da
$tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&) @ a.out.wasm:0x46d3
$void tbb::detail::d1::parallel_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0>(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&) @ a.out.wasm:0x3fa3
$__original_main @ a.out.wasm:0x3bb3
$main @ a.out.wasm:0xcbbb

Apr 19 '24 15:04 b-qp

@b-qp I believe we saw this problem before with static version of TBB (and only with static version). Is there a chance you can try to run your reproducer with static version of TBB to see if problem persists.

Apr 22 '24 13:04 pavelkumbrasev

@jellychen, I'm not really familiar with WASM work model. Is there a chance you can print threads stacks during parallel section execution where CPU utilization is equal to 1 thread running so we can see if threads are sleeping in thread pool for some reason or their stacks are also involved in computation? @pavelkumbrasev

I'm sorry for the late response; I've been on vacation recently. I'm not quite sure how to print the call stack. Could you tell me the exact steps?

Apr 30 '24 02:04 jellychen

@jellychen, it will be just a guess because I'm not familiar with a technology too. Is there a chance you can attach gdb to a process and call thread apply all bt. If you place break point into parallel region I would expect all of the worker threads participating.

Apr 30 '24 09:04 pavelkumbrasev

@jellychen, it will be just a guess because I'm not familiar with a technology too. Is there a chance you can attach gdb to a process and call thread apply all bt. If you place break point into parallel region I would expect all of the worker threads participating.

Maybe Wasm does not support gdb debugging

Apr 30 '24 09:04 jellychen

Could you please provide steps to reproduce the issue? (If you can do it with debug version of the library it also will be helpful)

Apr 30 '24 09:04 pavelkumbrasev

Could you please provide steps to reproduce the issue? (If you can do it with debug version of the library it also will be helpful)

Almost nothing special is required, as long as you compile to wasm to perform the simplest parallel tasks, you can say 100% sure to occur

Apr 30 '24 10:04 jellychen

@pavelkumbrasev see my comment above (https://github.com/oneapi-src/oneTBB/issues/1287#issuecomment-1999679412).

May 01 '24 17:05 pca006132

@pavelkumbrasev

I suspect that the multithreading mechanism of TBB does not work effectively under the web worker mechanism of Emscripten. It might not be an issue with TBB, perhaps it's a problem with the web itself. In any case, I haven't isolated the cause.

However, I have found a solution by implementing a set of interfaces similar to TBB, although not entirely. Many pieces of software only utilize parts of the TBB interface, mainly task_group, parallel_sort, parallel_for, and parallel_reduce.

My approach involves initializing a std::thread pool at startup and then bridging these implementations to std::thread.

So far, this solution has shown better effects than TBB in some software experiments. Currently, the multithreading performance of TBB in some wasm software, such as Openvdb, is even weaker than its single-threading performance.

I hope this can help most developers working on wasm.

May 11 '24 02:05 jellychen

@jellychen, I'm not sure if the problem is the Emscripten. I was able to reproduce described behavior and from my perspective something is odd. I will continue investigating the problem.

May 13 '24 13:05 pavelkumbrasev

@jellychen I have summarized concluded analysis into a set of questions into Emscripten discussion: https://github.com/emscripten-core/emscripten/discussions/21963

May 20 '24 10:05 pavelkumbrasev

oneTBB oneTBB copied to clipboard

tbb on wasm always executed on the main thread.

oneTBB
oneTBB copied to clipboard