MINGW-packages
MINGW-packages copied to clipboard
gcc thread_local destructor cause use after free
Hello,
I posted this issue on the gcc bug tracker, but I am not entirely sure if this is supposed to be a GCC bug or a mingw bug.
I see it has also been reported here, however at the moment I can not create a sourceforge account to comment on that. I apologize if this is not the proper place.
The issue here, to my best understanding, is the following.
-
At the very first access of a thread-local object in a new program,
emutls_init
is called via__emutls_get_address
. This function allocates some memory for the storage usingmalloc
(seeemutls_alloc
), and registersemutls_destroy
to be executed at thread exit, using__gthread_key_create
. -
Immediately after, the
TLS init function
(generated by GCC) is called. This calls the C++ constructor on the storage provided at the previous point. It also uses__gthread_key_create
to register the C++ destructor for execution at thread exit. -
When a thread terminates,
_pthread_cleanup_dest
in winpthread is called. This function is responsible for calling all the destructors registered withpthread_key_create
. Unfortunately, the execution order of such destructors is unspecified. In this case, it happens thatemutls_destroy
is called before the C++ destructor, which then executes on deallocated memory.
A quick workaround would be to reverse the loop in winpthread/src/thread.c:842. This will cause the destructors to be called in reverse order, and it will probably work correctly in most of the cases. The only other solution would be to merge emutls+TLS init on GCC side, in such a way to have a single destructor callback to both invoke the C++ destructor and cleanup the memory.
CC @lhmouse
Yes I knew this bug a couple of years ago... but who the heck has been caring about this problem?
The emutls thing is utterly broken. I happened to report a similar issue a few hours ago (if you LoadLibrary()
a DLL and the DLL creates some thread_local
objects, then FreeLibrary()
'ing that DLL is very likely to lead to a crash), not to mention that __emutls_get_address()
aborts the process once it runs out of memory.
So here comes the rule of thumb: Don't use thread_local
on GCC for MinGW targets.
Well.. pretty sad. I might post a patch to winpthread to workaround this specific issue.
My understanding is slightly different:
- On
thread_local
variable creation GCC calls__emutls_get_address
, which calls__gthread_key_create
, which presumably callspthread_key_create
fromwinpthreads
by MinGW. It adds corresponding "destructor" (emutls_destroy
which deallocates memory) into an array of destructors. At the moment, logic is as follows: try inserting at_pthread_key_sch
if available, go towards the end of the array, then wrap, then reallocate if the array is filled. However, it looks_pthread_key_sch
is the smallest available place in the array. - After completing the constructor successfully GCC calls
__cxa_thread_atexit
to register destructor. It's implemented inmingw-w64-crt
(notwinpthreads
), it calls__mingw_cxa_thread_atexit
, which adds destructor to a separate TLS no.tls_dtors_slot
. - When
std::thread
finishes, it returns control topthread_create_wrapper
fromwinpthreads
, it's the routine which actually runs the thread and is passed_beginthreadex
. This routine calls_pthread_cleanup_dest
in the thread before terminating with_endthreadex
. This in turn callsemutls_destroy
and memory is deallocated. - Now that
_endthreadex
is called and the system knows for sure the thread is terminating,tls_callback
frommingw-w64-crt
is called withDLL_THREAD_DETACH
and runs destructors from bullet 2 above.
So, the core issue looks like there are two separate mechanisms for calling something at thread exit: pthread_key_create
(which calls destructors before pthread-emulated thread is terminated) and __cxa_thread_atexit
(which calls destructors when thread is actually terminating). Memory is allocated through the former, destructors are called through the latter, hence the use-after-free.
Moreover, it looks like on Linux (at the very least) pthread_key_create
destructors are called strictly after __cxa_thread_atexit
, and on Windows it happens the other way around. Here is a sample code to test it:
#include <stdio.h>
#include <pthread.h>
#include <cxxabi.h>
int a, b, c, d, e;
extern int __dso_handle;
void mydestr_atexit(void *arg) {
printf("mydestr_atexit(%p)\n", arg);
}
void mydestr_pthread_key(void *arg) {
printf("mydestr_pthread_key(%p)\n", arg);
}
void* thread(void *arg) {
printf("thread %p\n", arg);
abi::__cxa_thread_atexit(mydestr_atexit, &b, &__dso_handle);
{
pthread_key_t key;
pthread_key_create(&key, mydestr_pthread_key);
pthread_setspecific(key, &c);
}
abi::__cxa_thread_atexit(mydestr_atexit, &d, &__dso_handle);
{
pthread_key_t key;
pthread_key_create(&key, mydestr_pthread_key);
pthread_setspecific(key, &e);
}
return nullptr;
}
int main() {
printf("a=%p\nb=%p\nc=%p\nd=%p\ne=%p\n", &a, &b, &c, &d, &e);
pthread_t th;
pthread_create(&th, nullptr, thread, &a);
pthread_join(th, nullptr);
}
Moreover, it looks like destructors by pthread_key
can easily be called in a wrong order: from oldest keys to newest. So it feels kinda wrong to use it for TLS emulation.
Relevant stuff: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58142 https://sourceforge.net/p/mingw-w64/bugs/727/ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80816
I "discovered" this and tried to make some progress, but there's still a case that results in use-after-free: https://sourceforge.net/p/mingw-w64/mailman/message/37186085/
Will it help if _pthread_cleanup_dest
in winpthreads
calls destructors registered in crt by __mingw_cxa_thread_atexit
(i.e. run_dtor_list
from mingw-w64-crt\crt\tls_atexit.c
)?
I believe something similar was suggested in this message, possibly for another issue: https://sourceforge.net/p/mingw-w64/mailman/message/37140300/
- Make crt expose a function like
__mingw_cxa_thread_finalize
which calls callbacks registered by__mingw_cxa_thread_atexit
and removes them from the list. That way we don't have to modify existing logic nearDLL_THREAD_DETACH
.void __mingw_cxa_thread_finalize(void) { assert(inited); // Not sure why it's needed, but sounds logical. dtor_obj * p = (dtor_obj *)TlsGetValue(tls_dtors_slot); run_dtor_list(&p); TlsSetValue(tls_dtors_slot, p); }
- Make winpthreads call
__mingw_cxa_thread_finalize
in_pthread_cleanup_dest
._pthread_cleanup_dest
is called either bypthread_exit
, or a terminating thread created viapthread_create
, or similarDLL_THREAD_DETACH
event (in which case we care less about order).
Such behavior would be a little more consistent with what happens on Linux: "user-defined" thread_atexit
callbacks run first, "low-level" pthread cleanup runs last. It is assumed by at least some consumers, e.g. emutls, maybe others.
Should help with this specific issue. Not sure how it plays with dynamic linking, configuring, etc. The patch would be quite small, though.
Also, I assume it's impossible for crt to be detached from a thread before winpthreads, as the latter depends(?) on the former.
This particular mingw-w64 "crt" code is present in each module, ie always statically linked. So each module has its own copy, and winpthreads calling it would call its copy only. Unless I'm missing something
Would probably be better to discuss such details on mingw-w64-public list though.
Can we turn the mingw's implementation of _Thread_local to be consistence with msvc? It's seems msvc'c __declspec(thread) doesn't depends on winpthreads at all.
Can we turn the mingw's implementation of _Thread_local to be consistence with msvc?
It's seems msvc'c __declspec(thread) doesn't depends on winpthreads at all.
Please read this series of articles first: http://www.nynaeve.net/?p=183
Can we turn the mingw's implementation of _Thread_local to be consistence with msvc? It's seems msvc'c __declspec(thread) doesn't depends on winpthreads at all.
Please read this series of articles first: http://www.nynaeve.net/?p=183
Yeap, I've read it before, so I post it here
Can we turn the mingw's implementation of _Thread_local to be consistence with msvc? It's seems msvc'c __declspec(thread) doesn't depends on winpthreads at all.
Please read this series of articles first: http://www.nynaeve.net/?p=183
Yeap, I've read it before, so I post it here
Then you should have known that there is no such support in GCC. It is consequently impossible at least for now.
I think it's time that we give the posix thread model up. See https://github.com/msys2/MINGW-packages/discussions/13259.
@lhmouse, sorry for the ping. Is this the same bug? Are there any plans to fix it? How does it affect the program?
Thread 10 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2308.0x36ac]
0x0000000001ef8d0c in std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long>, true>::_M_next (this=0xfeeefeeefeeefeee)
at C:/msys64/mingw64/include/c++/13.2.0/bits/hashtable_policy.h:377
377 { return static_cast<_Hash_node*>(this->_M_nxt); }
(gdb) bt
#0 0x0000000001ef8d0c in std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long>, true>::_M_next (
this=0xfeeefeeefeeefeee) at C:/msys64/mingw64/include/c++/13.2.0/bits/hashtable_policy.h:377
#1 0x00000000023171a6 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long>, true> > >::_M_deallocate_nodes (this=0x335b49a8, __n=0xfeeefeeefeeefeee) at C:/msys64/mingw64/include/c++/13.2.0/bits/hashtable_policy.h:2041
#2 0x0000000001f3cac7 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear (this=0x335b49a8)
at C:/msys64/mingw64/include/c++/13.2.0/bits/hashtable.h:2509
#3 0x0000000001f3cf08 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable (this=0x335b49a8, __in_chrg=<optimized out>)
at C:/msys64/mingw64/include/c++/13.2.0/bits/hashtable.h:1593
#4 0x0000000002066308 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long long, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long long> > >::~unordered_map (this=0x335b49a8, __in_chrg=<optimized out>) at C:/msys64/mingw64/include/c++/13.2.0/bits/unordered_map.h:109
#5 0x000000000173a72d in run_dtor_list (ptr=<optimized out>) at C:/M/B/src/mingw-w64/mingw-w64-crt/crt/tls_atexit.c:62
#6 run_dtor_list (ptr=0x2fd1f3d0) at C:/M/B/src/mingw-w64/mingw-w64-crt/crt/tls_atexit.c:56
#7 tls_callback (hDllHandle=<optimized out>, dwReason=<optimized out>, lpReserved=<optimized out>) at C:/M/B/src/mingw-w64/mingw-w64-crt/crt/tls_atexit.c:165
#8 0x00007ffc020c9a1d in ntdll!RtlActivateActivationContextUnsafeFast () from C:\WINDOWS\SYSTEM32\ntdll.dll
#9 0x00007ffc020c9aff in ntdll!RtlActivateActivationContextUnsafeFast () from C:\WINDOWS\SYSTEM32\ntdll.dll
#10 0x00007ffc020c763b in ntdll!LdrShutdownThread () from C:\WINDOWS\SYSTEM32\ntdll.dll
#11 0x00007ffc0210468e in ntdll!RtlExitUserThread () from C:\WINDOWS\SYSTEM32\ntdll.dll
#12 0x00007ffc005aafa7 in msvcrt!_endthreadex () from C:\WINDOWS\System32\msvcrt.dll
#13 0x00000000006342b2 in thread_base::finalize (_self=7608) at C:/src/rpcs3/Utilities/Thread.cpp:2241
#14 0x000000003003d494 in ?? ()
#15 0x0000000000001db8 in ?? ()
#16 0x0000000000000000 in ?? ()
@oltolm Someone expressed negative thoughts in mingw-w64 so I am afraid there is currently no plan on it.
A proper fix for it requires hacking the CRT to invoke thread-local destructors before calling exit()
^1 ^2. If you have time please try my personal toolchains here: https://gcc-mcf.lhmouse.com/
gcc14 win32 thread model seems to support c++11 thread. gcc14 will be released soon, so maybe consider use win32 thread model instead.
gcc14 win32 thread model seems to support c++11 thread. gcc14 will be released soon, so maybe consider use win32 thread model instead.
I don't think it will solve the issue: Destructors of static objects are called by the VCRT/UCRT exit()
function, and destructors of thread-local objects are called by the TLS callback which happens after exit()
calls ExitProcess()
. The cause of this problem is always there.
And in addition, switching the thread model is an ABI break.
How does Clang solve this problem?
How does Clang solve this problem?
It has nothing to do with Clang; the magic lies in UCRT. We have https://github.com/mingw-w64/mingw-w64/blob/dbfdf802580c0d6b90b91995dab019f2a7787a8e/mingw-w64-crt/crt/tls_atexit.c#L108-L119 which causes UCRT to invoke destructors of thread-local objects before static ones. For MSVCRT we have an emulated (non-conforming) one in 'mingw-w64-crt/misc/register_tls_atexit.c'.
https://github.com/mingw-w64/mingw-w64/blob/dbfdf802580c0d6b90b91995dab019f2a7787a8e/mingw-w64-crt/misc/register_tls_atexit.c
So the advantage of MCF over MSVCRT is correctness and performance and the advantage of MCF over UCRT is performance? Is that correct?
Why not create a minimal environment/subsystem for MCF with development tools only? It is probably source compatible, so building packages would be easy. Sadly I can not use it without pacman. I would contribute to the MCF subsystem.
So the advantage of MCF over MSVCRT is correctness and performance and the advantage of MCF over UCRT is performance? Is that correct?
Yes.
Why not create a minimal environment/subsystem for MCF with development tools only? It is probably source compatible, so building packages would be easy. Sadly I can not use it without pacman. I would contribute to the MCF subsystem.
See https://gcc-mcf.lhmouse.com/ which is largely repackaged MSYS2 environments.
See https://gcc-mcf.lhmouse.com/ which is largely repackaged MSYS2 environments.
I know about it, it's great, but as I said I can't use it without pacman. pacman is too convenient.
I know about it, it's great, but as I said I can't use it without pacman. pacman is too convenient.
How about this? https://github.com/lhmouse/MINGW-packages
One reason that I do not upload pacman packages is that they are all 'drop-in' replacements for MSYS2 ones, and I have to configure my local pacman.conf to ignore them; otherwise they might be upgraded by pacman to MSYS2 ones.
I can upload a snapshot of my local packages. Once you have these packages you can bootstrap them yourself.
@oltolm pacman packages can be found here: https://files.lhmouse.com/pkg-snapshots/ . Source scripts are in https://github.com/lhmouse/MINGW-packages .
Installing these packages locally with pacman -U
will produce environments for building everything else.