segfault on pthread_detach
Summary
- Create a thread in main function
- In this sub-thread functor, we use tbb::parallel_for
- run this program for serval days, it will coredump (but not every time)
Version
2021.10.0 Also we git the last source from github, the problem exists
Environment
Provide any environmental details that you consider significant for reproducing the issue. The following information is important:
- Hardware Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz / 64GB MEM
- OS name and version Linux precisetest 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- Compiler version gcc version 10.3.0 (GCC)
Observed Behavior
coredump /usr/src/glibc/glibc-2.31/nptl/pthread_detach.c:49 pthread_detach rml_thread_monitor.h:223 void thread_monitor::detach_thread(handle_type handle) private_server.cpp:233 void private_worker::release_handle(thread_handle handle, bool join) private_server.cpp:256 void private_worker::start_shutdown() private_server.cpp:188 void request_close_connection( bool /exiting/ )override thread_dispatcher.cpp:173 void thread_dispatcher::release(bool blocking_terminate) threading_control.cpp:99 void threading_control_impl::release(bool blocking_terminate) threading_control.cpp:259 bool threading_control::release(bool is_public, bool blocking_terminate) threading_control.cpp:320 bool threading_control::try_destroy_client(threading_control::client_snapshot deleter) arena.cpp:146 void arena::on_thread_leaving(unsigned ref_param) arena.cpp:195 void arena::process(thread_data& tls) thread_dispatcher_client.h:36 void thread_dispatcher_client::process(thread_data& td) thread_dispatcher.cpp:183 void thread_dispatcher::process(job& j) class thread_dispatcher : no_copy, rml::tbb_client private_server.cpp:271 void private_worker::run() noexcept private_server.cpp:219 private_worker::thread_routine( void* arg ) private_server.cpp:305 inline void private_worker::wake_or_launch()
Steps To Reproduce
code as following:
#include
#include "tbb/tbb.h"
const int THREADS = 8;
int multiple10(int val) { //printf("before task #%d\n", val); // std::this_thread::sleep_for(std::chrono::seconds(2)); // throw std::invalid_argument("test"); //printf("after task #%d\n", val); return val * val; }
int main(int argc, char* argv[]) {
const size_t N = 1000000;
std::vector
std::vector<int> result(count);
std::chrono::time_point<std::chrono::high_resolution_clock> startTime, endTime;
startTime = std::chrono::high_resolution_clock::now();
std::thread thd([&]()
{
// tbb::global_control c(tbb::global_control::max_allowed_parallelism, THREADS);
tbb::parallel_for(tbb::blocked_range<int>(0, count), [&](const tbb::blocked_range<int>& r) {
for (int i = r.begin(); i != r.end(); ++i) {
result.at(i) = multiple10(data.at(i));
}
});
}
);
thd.join();
endTime = std::chrono::high_resolution_clock::now();
auto sec = std::chrono::duration_cast<std::chrono::seconds>(endTime.time_since_epoch() -
startTime.time_since_epoch())
.count();
//std::cout << ":duration " << sec << "s" << std::endl;
//for (int i = 0; i < count; ++i) {
// std::cout << data[i] << " -> " << result[i] << std::endl;
//}
return 0;
}
Just wanted to create an issue for this as well. This is very likely due to a race condition in the pthread_detach code itself. This is the bugticket: https://sourceware.org/bugzilla/show_bug.cgi?id=19951. I will add a comment to that ticket later today.
We actually see this issue in our use cases multiple times a week, so what I did for us is to force a join on shutdown (with a manual tbb patch). I think this is the only real solution, while the bug still persists in glibc.
This is probably the same issue as reported here: https://github.com/uxlfoundation/oneTBB/issues/334
@chenlimingcn which glibc version are you using? I guess it is probably not newer than 2.35, right? It would be interesting for the glibc bugticket.
I believe we're seeing this glibc issue on our servers with glibc 2.39. We don't use oneTBB, though.
If you want you can post to the glibc bugzilla thread. If did not have the chance to follow up, because I am now working at a different company and also because we were able to solve it by patching tbb to not use detach anymore. It is written in the bugzilla thread as well, but if you want to avoid the crash you need to avoid detaching a thread that can concurrently shut down. So the two main options are a) do not detach but join the threads b) somehow enforce that the thread to be detached is definitely still running or definitely already shut down at the moment were you call detach. I did not do a thorough search but I think the last discussion of a bug fix was this one: https://sourceware.org/pipermail/libc-alpha/2025-July/168738.html.