What does `task_group_context` do?
While modifying TBB sources, I was trying really hard to understand what is happening in task_group_context and task_group_context_impl, but I failed miserably.
I'm trying to construct task_group_context, and then use it to execute a task in multiple separate threads in local_wait_for_all loop via execute_and_wait function. However, no matter what I do it just ends up being something like this (gdb output):
tbb::detail::r1::task_group_context_impl::copy_fp_settings (ctx=..., src=...) at onetbb-src/src/tbb/task_group_context.cpp:278
278 new (&ctx.my_cpu_ctl_env) d1::cpu_ctl_env(*src_ctl);
(gdb) bt
#0 tbb::detail::r1::task_group_context_impl::copy_fp_settings (ctx=..., src=...) at onetbb-src/src/tbb/task_group_context.cpp:278
#1 0x0000aaaaaaac24c0 in tbb::detail::r1::task_group_context_impl::bind_to_impl (ctx=..., td=td@entry=0xffff98000900)
at onetbb-src/src/tbb/task_group_context.cpp:123
#2 0x0000aaaaaaac2628 in tbb::detail::r1::task_group_context_impl::bind_to (ctx=..., td=td@entry=0xffff98000900) at onetbb-src/src/tbb/task_group_context.cpp:188
#3 0x0000aaaaaaabeb88 in tbb::detail::r1::task_dispatcher::execute_and_wait (t=t@entry=0xffff98000e00, wait_ctx=..., w_ctx=...)
at onetbb-src/src/tbb/task_dispatcher.cpp:161
#4 0x0000aaaaaaabf34c in tbb::detail::r1::execute_and_wait (t=..., t_ctx=..., wait_ctx=..., w_ctx=...) at onetbb-src/src/tbb/task_dispatcher.cpp:121
#5 0x0000aaaaaaaaad70 in tbb::detail::d1::execute_and_wait (w_ctx=..., wait_ctx=..., t_ctx=..., t=...)
at onetbb-src/include/oneapi/tbb/detail/_task.h:191
I tried:
- sharing one context between all
execute_and_waitcalls - creating separate context for each call
Also I tried constructing task_group_context with 2 different argument sets (total 2 * 2 = 4 configurations):
PARALLEL_FORtbb::task_group_context::bound, tbb::task_group_context::default_traits | tbb::task_group_context::concurrent_wait
Yet I still get the same error :(
Thank you very much for all your previous answers btw.
Hi @blonded04, Sorry I didn't understand your question. What is the problem you highlighted in gdb?
Hi @blonded04, Sorry I didn't understand your question. What is the problem you highlighted in gdb?
I get segfaults on that line :(
new (&ctx.my_cpu_ctl_env) d1::cpu_ctl_env(*src_ctl);
Probably something wrong with ctx.my_cpu_ctl_env (null potentially?) or with src_ctl.
It hard to say what went wrong. Could you provide reproducer?
Sorry, it took a while, however the smallest reproducer I got problems with task_group_context is 65 lines of diff to TBB (Ubuntu, Ryzen 5500u).
My hypothesis is that TBB is not a friend of creating a context inside local_wait_for_all where context already exists. If I'm right, how can I safely change context.
main.cpp:
#include <atomic>
#include <iostream>
// shared_state between tbb and main.cpp
#include <oneapi/tbb/problems.h>
#include <tbb/parallel_for.h>
constexpr unsigned nthread = 12u;
int main() {
// force nthread threads to join arena (they wont leave because wait-limit in task dispatcher is increased)
std::atomic<unsigned> spin_barrier(nthread);
tbb::parallel_for(tbb::blocked_range<int>(0, nthread), [&spin_barrier](tbb::blocked_range<int>) {
spin_barrier.fetch_sub(1, std::memory_order_release);
while (spin_barrier.load(std::memory_order_acquire)) {
asm volatile ("pause\npause\npause\npause");
}
});
std::cout << "parallel_for_finished" << std::endl;
// enable problematic behaviour
tbb::set_flag();
// some time later we will face SEGFAULT
while (tbb::get_counter() < 1000000u) {
std::cout << "\t" << tbb::get_counter() << std::endl;
}
// we won't even get to it
tbb::set_flag(false);
}
GDB backtrace for segfault:
Thread 2 "main" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff799c700 (LWP 12481)]
tbb::detail::r1::context_guard_helper<false>::set_ctx (ctx=0x7ffff799bcc0, this=0x7ffff799bb80) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/scheduler_common.h:155
155 curr_cpu_ctl_env.set_env();
(gdb) bt
#0 tbb::detail::r1::context_guard_helper<false>::set_ctx (ctx=0x7ffff799bcc0, this=0x7ffff799bb80) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/scheduler_common.h:155
#1 tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7ffff0001e80, this=0x46a480) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:311
#2 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x46a480) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:472
#3 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.cpp:168
#4 0x000000000040c0c7 in tbb::detail::d1::execute_and_wait (w_ctx=..., wait_ctx=..., t_ctx=..., t=warning: RTTI symbol not found for class 'tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>'
...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/detail/_task.h:191
#5 tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned int> const&, {lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&, tbb::detail::d1::auto_partitioner&, tbb::detail::d1::task_group_context&) (context=..., partitioner=..., body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:113
#6 tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned int> const&, {lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&, tbb::detail::d1::auto_partitioner&, tbb::detail::d1::task_group_context&) (context=..., partitioner=..., body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:105
#7 tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned int> const&, {lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&, tbb::detail::d1::auto_partitioner&) (partitioner=..., body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:102
#8 tbb::detail::d1::parallel_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}>(tbb::detail::d1::blocked_range<unsigned int> const&, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&) (body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:230
#9 tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter> (this=this@entry=0x46a480, tls=..., ed=..., waiter=..., isolation=<optimized out>, fifo_allowed=<optimized out>, critical_allowed=<optimized out>) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:238
#10 0x0000000000407c68 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (waiter=..., t=0x0, this=0x46a480) at /usr/include/c++/11/bits/atomic_base.h:818
#11 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (waiter=..., t=0x0, this=0x46a480) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:472
#12 tbb::detail::r1::arena::process (this=<optimized out>, tls=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/arena.cpp:137
#13 0x0000000000418b0b in tbb::detail::r1::market::process (this=0x460400, j=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/market.cpp:599
#14 0x000000000041be19 in tbb::detail::r1::rml::private_worker::run (this=0x468e00) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/private_server.cpp:271
#15 0x000000000041bfad in tbb::detail::r1::rml::private_worker::thread_routine (arg=<optimized out>) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/private_server.cpp:221
#16 0x00007ffff7b9c609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007ffff7ac1133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Diff in TBB:
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 47872941..eaa81b12 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -66,7 +66,7 @@ include(CMakeDependentOption)
# Handle C++ standard version.
if (NOT MSVC) # no need to cover MSVC as it uses C++14 by default.
if (NOT CMAKE_CXX_STANDARD)
- set(CMAKE_CXX_STANDARD 11)
+ set(CMAKE_CXX_STANDARD 20)
endif()
if (CMAKE_CXX${CMAKE_CXX_STANDARD}_STANDARD_COMPILE_OPTION) # if standard option was detected by CMake
@@ -108,7 +108,7 @@ option(TBB_DISABLE_HWLOC_AUTOMATIC_SEARCH "Disable HWLOC automatic search by pkg
option(TBB_ENABLE_IPO "Enable Interprocedural Optimization (IPO) during the compilation" ON)
if (NOT DEFINED BUILD_SHARED_LIBS)
- set(BUILD_SHARED_LIBS ON)
+ set(BUILD_SHARED_LIBS OFF)
endif()
if (NOT BUILD_SHARED_LIBS)
diff --git a/include/oneapi/tbb/parallel_for.h b/include/oneapi/tbb/parallel_for.h
index 91c7c44c..fd3f6aee 100644
--- a/include/oneapi/tbb/parallel_for.h
+++ b/include/oneapi/tbb/parallel_for.h
@@ -29,6 +29,7 @@
#include "task_group.h"
#include <cstddef>
+#include <functional>
#include <new>
namespace tbb {
@@ -109,12 +110,12 @@ struct start_for : public task {
// defer creation of the wait node until task allocation succeeds
wait_node wn;
for_task.my_parent = &wn;
- execute_and_wait(for_task, context, wn.m_wait, context);
+ d1::execute_and_wait(for_task, context, wn.m_wait, context);
}
}
//! Run body for range, serves as callback for partitioner
void run_body( Range &r ) {
- tbb::detail::invoke(my_body, r);
+ my_body(r);
}
//! spawn right task, serves as callback for partitioner
diff --git a/include/oneapi/tbb/problems.h b/include/oneapi/tbb/problems.h
new file mode 100644
index 00000000..fa380b53
--- /dev/null
+++ b/include/oneapi/tbb/problems.h
@@ -0,0 +1,37 @@
+#pragma once
+
+#include <atomic>
+
+namespace tbb {
+
+namespace internal {
+
+inline std::atomic<bool>& get_flag_impl() {
+ static std::atomic<bool> flag(false);
+ return flag;
+}
+
+inline std::atomic<unsigned>& get_counter_impl() {
+ static std::atomic<unsigned> counter(0u);
+ return counter;
+}
+
+} // namespace internal
+
+inline void set_flag(bool value=true) {
+ internal::get_flag_impl().store(value, std::memory_order_release);
+}
+
+inline bool get_flag() {
+ return internal::get_flag_impl().load(std::memory_order_acquire);
+}
+
+inline unsigned get_counter() {
+ return internal::get_counter_impl().load(std::memory_order_acquire);
+}
+
+inline void increment_counter() {
+ internal::get_counter_impl().fetch_add(1, std::memory_order_release);
+}
+
+} // namespace tbb
\ No newline at end of file
diff --git a/src/tbb/scheduler_common.h b/src/tbb/scheduler_common.h
index 9e103657..ddf082aa 100644
--- a/src/tbb/scheduler_common.h
+++ b/src/tbb/scheduler_common.h
@@ -254,7 +254,7 @@ public:
// threshold value tuned separately for macOS due to high cost of sched_yield there
, my_yield_threshold{10 * yields_multiplier}
#else
- , my_yield_threshold{100 * yields_multiplier}
+ , my_yield_threshold{10000000 * yields_multiplier}
#endif
, my_pause_count{}
, my_yield_count{}
diff --git a/src/tbb/task_dispatcher.h b/src/tbb/task_dispatcher.h
index f6ff3f17..aa16bf57 100644
--- a/src/tbb/task_dispatcher.h
+++ b/src/tbb/task_dispatcher.h
@@ -30,6 +30,10 @@
#include "itt_notify.h"
#include "concurrent_monitor.h"
+#include "oneapi/tbb/task_group.h"
+#include "oneapi/tbb/parallel_for.h"
+#include "oneapi/tbb/problems.h"
+
#include <atomic>
#if !__TBB_CPU_CTL_ENV_PRESENT
@@ -229,6 +233,16 @@ d1::task* task_dispatcher::receive_or_steal_task(
}
// Nothing to do, pause a little.
waiter.pause(slot);
+
+ if (get_flag()) {
+ tbb::parallel_for(
+ tbb::blocked_range<unsigned>{0u, arena_index + 1u},
+ [] (tbb::blocked_range<unsigned> range) {
+ for (unsigned idx = range.begin(); idx < range.end(); idx++) {
+ increment_counter();
+ }
+ });
+ }
} // end of nonlocal task retrieval loop
__TBB_ASSERT(is_alive(a.my_guard), nullptr);
What happens internally is I end up calling local_wait_for_all inside local_wait_for_all, but not inside any other task.
Is there any assumptions that forbid that kind of behavior and are there any workarounds that respect existing invariants?
Hi @blonded04
It seems you are trying to do a parallel_for inside the steal loop which is something we never tried. Looking at your gdb it's missing debug information (possibly due to optimizations turned on ). So i would like to ask you to run it on debug mode so that assertions should provide some context on what is going on.
But I would like to clarify running parallel_for inside the steal loop is something we never recommend to do.
Hi @blonded04 , can we close this issue?
Hi @blonded04 , can we close this issue?
Yes! Thank you very much for help