oneTBB copied to clipboard
Deadlock issue in OpenBLAS with TBB
Brief Description: I am trying out this OpenBLAS PR [] with TBB. I first register a callback in my code to dynamically change the threading backend. Instead of creating its own threads, OpenBLAS passes the work to the registered callback. I use TBB for running gemm and again want to use TBB for executing the callback.
Issue: I am facing deadlock issue in OpenBLAS (multiple threads get stuck in inner_threads function in OpenBLAS). OpenBLAS apears to encounter deadlock when used with fewer threads than no of available threads.
Below is my test code and steps to reproduce it.
#include <iostream>
#include <cblas.h>
#include <vector>
#include <tbb/tbb.h>
#include <chrono>
const int MATRIX_DIMENSION = 1000; // Adjust as needed
bool delay_threading = 1;
class MatrixMultiplicationTask {
const std::vector<double>& A;
const std::vector<double>& B;
std::vector<double>& C;
MatrixMultiplicationTask(const std::vector<double>& A,
const std::vector<double>& B,
std::vector<double>& C)
: A(A), B(B), C(C) {}
void operator()(const tbb::blocked_range<int>& range) const {
for (int i = range.begin(); i != range.end(); ++i) {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
class InnerLoopTask {
openblas_dojob_callback dojob;
int numjobs;
size_t jobdata_elsize;
void* jobdata;
int dojob_data;
InnerLoopTask(openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void* jobdata, int dojob_data)
: dojob(dojob), numjobs(numjobs), jobdata_elsize(jobdata_elsize), jobdata(jobdata), dojob_data(dojob_data) {}
void operator()(const tbb::blocked_range<int>& range) const {
for (int i = range.begin(); i != range.end(); ++i) {
void* element_adrr = (void*)(((char*)jobdata) + ((unsigned)i) * jobdata_elsize);
dojob(i, element_adrr, dojob_data);
class MyObserver : public tbb::task_scheduler_observer {
MyObserver() {
~MyObserver() {
void on_scheduler_entry(bool is_worker) override {
std::cout << "Task scheduler entry" << std::endl;
void on_scheduler_exit(bool is_worker) override {
std::cout << "Task scheduler exit" << std::endl;
void myfunction_ (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
//MyObserver observer;
InnerLoopTask innerLoopTask(dojob, numjobs, jobdata_elsize, jobdata, dojob_data);
//tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), innerLoopTask);
int main() {
// Dynamically create matrices using std::vector for easier management
std::vector<double> A(MATRIX_DIMENSION * MATRIX_DIMENSION, 8.0);
std::vector<double> B(MATRIX_DIMENSION * MATRIX_DIMENSION, 5.0);
std::vector<double> C(MATRIX_DIMENSION * MATRIX_DIMENSION, 0.5);
if (delay_threading)
auto start = std::chrono::high_resolution_clock::now();
tbb::parallel_for(tbb::blocked_range<int>(0, 2), MatrixMultiplicationTask(A,B,C));
auto stop = std::chrono::high_resolution_clock::now();
// Output a portion of the result (printing the entire matrix would be too much)
for (int i = 0; i < 10; ++i) {
for (int j = 0; j < 10; ++j) {
std::cout << C[i * MATRIX_DIMENSION + j] << "\t";
std::cout << std::endl;
// Compute the duration
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Time taken by function: " << duration.count() << " milliseconds\n";
return 0;
Run command: g++ -std=c++11 -o tbb_nested tbb_nested.cpp -ltbb -lpthread -I/home/openblas/include -L/home/openblas/lib -lopenblas -Wl,-rpath,/home/openblas/lib
Help needed: So as you can see here, I have below case of nested parallelism,
outer loop: tbb::parallel_for(tbb::blocked_range
In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used. What is the best possible nested solution provided by TBB to solve this problem? Kindly advise.
Hi @goplanid,
To guarantee parallelism in the inner loop, you could use TBB in the outer loop only. In the inner loop, you could launch numjobs
threads (e.g., with std::thread
) in myfunction_
, with each thread performing an InnerLoopTask
You can prevent oversubscription by throttling down the oneTBB concurrency (e.g., to hardware_concurrency
/ numjobs
@goplanid is this issue still relevant?
If anyone encounter this issue in the future please open new issue with a link to this one