oneTBB icon indicating copy to clipboard operation
oneTBB copied to clipboard

segfault on pthread_detach

Open chenlimingcn opened this issue 6 months ago • 4 comments

Summary

  1. Create a thread in main function
  2. In this sub-thread functor, we use tbb::parallel_for
  3. run this program for serval days, it will coredump (but not every time)

Version

2021.10.0 Also we git the last source from github, the problem exists

Environment

Provide any environmental details that you consider significant for reproducing the issue. The following information is important:

  • Hardware Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz / 64GB MEM
  • OS name and version Linux precisetest 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Compiler version gcc version 10.3.0 (GCC)

Observed Behavior

coredump /usr/src/glibc/glibc-2.31/nptl/pthread_detach.c:49 pthread_detach rml_thread_monitor.h:223 void thread_monitor::detach_thread(handle_type handle) private_server.cpp:233 void private_worker::release_handle(thread_handle handle, bool join) private_server.cpp:256 void private_worker::start_shutdown() private_server.cpp:188 void request_close_connection( bool /exiting/ )override thread_dispatcher.cpp:173 void thread_dispatcher::release(bool blocking_terminate) threading_control.cpp:99 void threading_control_impl::release(bool blocking_terminate) threading_control.cpp:259 bool threading_control::release(bool is_public, bool blocking_terminate) threading_control.cpp:320 bool threading_control::try_destroy_client(threading_control::client_snapshot deleter) arena.cpp:146 void arena::on_thread_leaving(unsigned ref_param) arena.cpp:195 void arena::process(thread_data& tls) thread_dispatcher_client.h:36 void thread_dispatcher_client::process(thread_data& td) thread_dispatcher.cpp:183 void thread_dispatcher::process(job& j) class thread_dispatcher : no_copy, rml::tbb_client private_server.cpp:271 void private_worker::run() noexcept private_server.cpp:219 private_worker::thread_routine( void* arg ) private_server.cpp:305 inline void private_worker::wake_or_launch()

Steps To Reproduce

code as following: #include #include #include

#include "tbb/tbb.h"

const int THREADS = 8;

int multiple10(int val) { //printf("before task #%d\n", val); // std::this_thread::sleep_for(std::chrono::seconds(2)); // throw std::invalid_argument("test"); //printf("after task #%d\n", val); return val * val; }

int main(int argc, char* argv[]) { const size_t N = 1000000; std::vector data(N, 1); int count = data.size(); for (int i = 0; i < count; ++i) { data[i] = i + 1; }

std::vector<int> result(count);

std::chrono::time_point<std::chrono::high_resolution_clock> startTime, endTime;

startTime = std::chrono::high_resolution_clock::now();
std::thread thd([&]()
{
    // tbb::global_control c(tbb::global_control::max_allowed_parallelism, THREADS);
    tbb::parallel_for(tbb::blocked_range<int>(0, count), [&](const tbb::blocked_range<int>& r) {
        for (int i = r.begin(); i != r.end(); ++i) {
            result.at(i) = multiple10(data.at(i));
        }
    });
}
);
thd.join();
endTime = std::chrono::high_resolution_clock::now();

auto sec = std::chrono::duration_cast<std::chrono::seconds>(endTime.time_since_epoch() -
                                                            startTime.time_since_epoch())
               .count();

//std::cout << ":duration " << sec << "s" << std::endl;
//for (int i = 0; i < count; ++i) {
//    std::cout << data[i] << " -> " << result[i] << std::endl;
//}

return 0;

}

gdb info.txt

chenlimingcn avatar Jun 18 '25 01:06 chenlimingcn

Just wanted to create an issue for this as well. This is very likely due to a race condition in the pthread_detach code itself. This is the bugticket: https://sourceware.org/bugzilla/show_bug.cgi?id=19951. I will add a comment to that ticket later today.

We actually see this issue in our use cases multiple times a week, so what I did for us is to force a join on shutdown (with a manual tbb patch). I think this is the only real solution, while the bug still persists in glibc.

This is probably the same issue as reported here: https://github.com/uxlfoundation/oneTBB/issues/334

griebelsi avatar Jun 27 '25 08:06 griebelsi

@chenlimingcn which glibc version are you using? I guess it is probably not newer than 2.35, right? It would be interesting for the glibc bugticket.

griebelsi avatar Jun 27 '25 08:06 griebelsi

I believe we're seeing this glibc issue on our servers with glibc 2.39. We don't use oneTBB, though.

luke-gruber avatar Nov 11 '25 18:11 luke-gruber

If you want you can post to the glibc bugzilla thread. If did not have the chance to follow up, because I am now working at a different company and also because we were able to solve it by patching tbb to not use detach anymore. It is written in the bugzilla thread as well, but if you want to avoid the crash you need to avoid detaching a thread that can concurrently shut down. So the two main options are a) do not detach but join the threads b) somehow enforce that the thread to be detached is definitely still running or definitely already shut down at the moment were you call detach. I did not do a thorough search but I think the last discussion of a bug fix was this one: https://sourceware.org/pipermail/libc-alpha/2025-July/168738.html.

griebelsi avatar Nov 12 '25 13:11 griebelsi