mimalloc
mimalloc copied to clipboard
Incorrect default MI_TLS_SLOT value
The current MI_TLS_SLOT in v3 uses offset 0x888 when MI_WIN_USE_FIXED_TLS is enabled:
#define MI_TLS_SLOT (0x888) // Last user-reserved slot <https://en.wikipedia.org/wiki/Win32_Thread_Information_Block>
// #define MI_TLS_SLOT (0x1678) // Last TlsSlot (might clash with other app reserved slot)
static inline void* mi_prim_tls_slot(size_t slot) mi_attr_noexcept {
#if (_M_X64 || _M_AMD64) && !defined(_M_ARM64EC)
return (void*)__readgsqword((unsigned long)slot); // direct load at offset from gs
However I don't see how this makes any sense? That seems to be middle of TEB::Win32ClientInfo or to be more exact afAsyncKeyStateRecentDown field of the tagCLIENTINFO type that Win32ClientInfo represents (at least on older windows versions, because seems like recent win11 touched that specific field). Basically if I'm not mistaken calling GetAsyncKeyState would overwrite the mimalloc TLS "slot"?
You may well be right -- the use of static TLS on Windows is still a bit experimental (but it really improves codegen for mi_malloc so it would be nice to make it default). However, we need a fixed slot that is just for mimalloc and I am not sure how to pick the right one -- I chose 0x888 based on the info at https://en.wikipedia.org/wiki/Win32_Thread_Information_Block but clearly that seems wrong. 0x1678 might be a good choice but it may go wrong if a program reserves all user TLS slots :-(. Would you have a suggestion for a better slot?
Maybe this is a cautionary tale that undocumented OS internals, in fact, do sometimes change 😛 (The Wikipedia articles states the TEB layout was extracted from Wine, and I'm guessing they used an older Windows as a base. The Vergilius table, on the other hand, claims to be for a quite recent version.)
Strictly speaking, there's no good default index; even if you're able to find an "unused" index, there's no guarantee it doesn't get used in the future, either because an OS component now needs it, or maybe because just stuff happens to shift around. Or well, I guess there could be a "good" default, if you manage to convince a kernel person to give you an index and the promise to never ever change it. (And even then there might be an issue if two, otherwise independent, mimalloc copies end up in the same process.)
Maybe there shouldn't be a default, rather a "bring your own TLS slot" scheme; this would at least be viable for OS components or drivers that end up mimalloc. And at least app developers that specify some slot to use should (hopefully) be aware of the risks and limitations.
Speaking about codegen - I'm a bit curious, how do the different variants (TEB use, thread_local variable, use API for slot) compare?
My guess would be that they're, overhead-wise, in that order from least to most, but maybe you previously compared them more scientifically?
I pushed an update where we always use the second user reserved slot but test at runtime if we indeed can allocate that slot. If we are initialized first (as mimalloc should) this should always hold. This seems a quite nice solution for mimalloc's case but we should test more.
Btw. here is the code generated for mi_malloc with a fixed TLS slot by msvc:
mov r10,qword ptr gs:[1488h] ; read the heap pointer at the TLS slot
cmp rcx,400h ; too large?
ja mi_malloc_generic
lea rax,[rcx+7] ; round by 8
shr rax,3
mov rdx,qword ptr [r10+rax*8+0F0h] ; load page from heap small page array
mov rax,qword ptr [rdx+8] ; rax = page->free
test rax,rax ; is it NULL ?
je mi_malloc_generic
mov rcx,qword ptr [rax] ; load next free object
inc word ptr [rdx+18h] ; page->used++
mov qword ptr [rdx+8],rcx ; page->free = next;
ret
while without using a fixed TLS slot, we get much worse code (with msvc on Windows):
mov qword ptr [rsp+8],rbx ; set up a stack frame
push rdi
sub rsp,20h
mov rax,qword ptr gs:[58h] ; rax = tls array
mov rbx,rcx
mov edx,dword ptr [_tls_index (07FF7B574FE30h)] ; get static tls index
mov rdi,qword ptr [rax+rdx*8] ; rdi = thread local variable array
mov eax,18h
cmp byte ptr [rax+rdi],0 ; check tls initialized flag
jne :init
call __dyn_tls_on_demand_init (07FF7B573FFECh) ; initialize on demand
:init
mov ecx,10h ; tls offset
mov rcx,qword ptr [rcx+rdi] ; rcx = thread local heap*
cmp rbx,400h ; too large?
ja mi_malloc_generic
lea rax,[rbx+7] ; round to 8
shr rax,3
mov rdx,qword ptr [rcx+rax*8+0F0h]
mov rax,qword ptr [rdx+8]
test rax,rax
je mi_malloc_generic
mov rcx,qword ptr [rax]
inc word ptr [rdx+18h]
mov qword ptr [rdx+8],rcx
mov rbx,qword ptr [rsp+30h] ; restore stack
add rsp,20h
pop rdi
ret
(almost half the instructions are just to get the thread local and the on-demand initialization)
Maybe this is a cautionary tale that undocumented OS internals, in fact, do sometimes change 😛
Microsoft basically maintains backwards compatibility of TEB data structure. Generally fields don't shift and simply become unused.
(The Wikipedia articles states the TEB layout was extracted from Wine, and I'm guessing they used an older Windows as a base. The Vergilius table, on the other hand, claims to be for a quite recent version.)
The Wine TEB isn't 1:1 to actual Windows TEB and whoever put it on Wikipedia as if it was, is responsible for the misunderstanding. Fact of the matter is that the field in question has been the same since at last Windows 2000, if not NT 4.0.
And even then there might be an issue if two, otherwise independent, mimalloc copies end up in the same process.
This would be my concern as well, although I don't think it's a common thing to have 2 independent versions of a custom allocator in the same process.
Maybe there shouldn't be a default, rather a "bring your own TLS slot" scheme; this would at least be viable for OS components or drivers that end up mimalloc. And at least app developers that specify some slot to use should (hopefully) be aware of the risks and limitations.
A general user has very little understanding of what fields in TEB are actually unused or their lifetime in the process to make a good choice.
You may well be right -- the use of static TLS on Windows is still a bit experimental (but it really improves codegen for
mi_mallocso it would be nice to make it default). However, we need a fixed slot that is just for mimalloc and I am not sure how to pick the right one -- I chose 0x888 based on the info at https://en.wikipedia.org/wiki/Win32_Thread_Information_Block but clearly that seems wrong. 0x1678 might be a good choice but it may go wrong if a program reserves all user TLS slots :-(. Would you have a suggestion for a better slot?
From a quick look for maximum compatibility, I think Vdm pointer at 0x1690 for 64 bit might be a good choice. It is consistent with Wine for Linux. And even on 32 bit, I doubt anyone would use mimalloc and decide that they need to run a 16 bit dos program in the same process on an old version of windows...
while without using a fixed TLS slot, we get much worse code (with msvc on Windows):
I think part of the reason why the codegen is bad is that the initializer for the thread local is non-trivial. Have you tried making the thread local pointers initialize to nullptr and then setting them up in the TLS callback like in the raw TLS slot version?
I pushed an update where we always use the second user reserved slot but test at runtime if we indeed can allocate that slot. If we are initialized first (as mimalloc should) this should always hold. This seems a quite nice solution for mimalloc's case but we should test more.
Neat, clever!
while without using a fixed TLS slot, we get much worse code (with msvc on Windows):
Curious. I tried with a simple thread_local variable on godbolt, and accessing that was about the same as accessing the TEB.
(But quite possibly my experiment was too simple.)
Curious. I tried with a simple
thread_localvariable on godbolt, and accessing that was about the same as accessing the TEB. (But quite possibly my experiment was too simple.)
Hmm... looks like if you have an extern thread_local variable, MSVC plays it safe and emits the dynamic TLS initialization...
but if the definition is not extern, there seems to be a chance MSVC omits that call...
So I tried to avoid extern thread_local:
diff --git a/include/mimalloc/internal.h b/include/mimalloc/internal.h
index 9c08515e..082edffc 100644
--- a/include/mimalloc/internal.h
+++ b/include/mimalloc/internal.h
@@ -26,7 +26,7 @@ terms of the MIT license. A copy of the license can be found in the file
#define mi_decl_noinline __declspec(noinline)
#define mi_decl_thread __declspec(thread)
#define mi_decl_align(a) __declspec(align(a))
-#define mi_decl_weak
+#define mi_decl_weak __declspec(selectany)
#define mi_decl_hidden
#elif (defined(__GNUC__) && (__GNUC__ >= 3)) || defined(__clang__) // includes clang and icc
#define mi_decl_noinline __attribute__((noinline))
diff --git a/include/mimalloc/prim.h b/include/mimalloc/prim.h
index 7fbeef2a..ec4c2298 100644
--- a/include/mimalloc/prim.h
+++ b/include/mimalloc/prim.h
@@ -270,7 +270,11 @@ static inline void mi_prim_tls_slot_set(size_t slot, void* value) mi_attr_noexce
// defined in `init.c`; do not use these directly
+#if defined(_MSC_VER)
+mi_decl_weak mi_decl_thread mi_heap_t* _mi_heap_default = (mi_heap_t*)&_mi_heap_empty;
+#else
extern mi_decl_hidden mi_decl_thread mi_heap_t* _mi_heap_default; // default heap to allocate from
+#endif
extern mi_decl_hidden bool _mi_process_is_initialized; // has mi_process_init been called?
static inline mi_threadid_t __mi_prim_thread_id(void) mi_attr_noexcept;
diff --git a/src/init.c b/src/init.c
index 892f4988..75dbe5c0 100644
--- a/src/init.c
+++ b/src/init.c
@@ -183,8 +183,10 @@ mi_threadid_t _mi_thread_id(void) mi_attr_noexcept {
return _mi_prim_thread_id();
}
+#ifndef _MSC_VER
// the thread-local default heap for allocation
mi_decl_thread mi_heap_t* _mi_heap_default = (mi_heap_t*)&_mi_heap_empty;
+#endif
bool _mi_process_is_initialized = false; // set to `true` in `mi_process_init`.
The generated assembly seems much nicer, closer to the fixed TLS slot case:
; COMDAT mi_mallocn
_TEXT SEGMENT
count$ = 8
size$ = 16
mi_mallocn PROC ; COMDAT
; File D:\sources\mimalloc\include\mimalloc\prim.h
; 425 : return _mi_heap_default;
mov r8d, DWORD PTR _tls_index
mov rax, QWORD PTR gs:88
mov r9d, OFFSET FLAT:?_mi_heap_default@@3PEAUmi_heap_s@@EA ; _mi_heap_default
mov rax, QWORD PTR [rax+r8*8]
; File D:\sources\mimalloc\src\alloc.c
; 243 : return mi_heap_mallocn(mi_prim_get_default_heap(),count,size);
mov r8, rdx
mov rdx, rcx
mov rcx, QWORD PTR [r9+rax]
jmp mi_heap_mallocn
mi_mallocn ENDP
_TEXT ENDS
Another detail: the "initialize on demand" stuff is not emitted by MSVC when compiling in C mode. Makes sense: dynamic variable initializers are not allowed in C, so no need to (possibly) account for them.
Conversely, it's possible that extern thread_local also causes code for "initialize on demand" being emitted on other compilers when compiling in C++ mode. Didn't test that, though.
Hi @res2k: ah I was a bit too enthusiastic -- it looks like a great trick, very clever use of selectany! However, the pipeline failed and after more testing on my local system I now think it does not work; any fresh thread is not getting the right initialization from the COMDAT section. Not sure how that would work and I think the selectany may just not work with thread_local. To make it work, we need to initialize it before any allocations... and if we start doing this I feel the TLS slot technique is actually a better solution. For now, I will revert the selectany trick unfortunately :-(
Hi @res2k: ah I was a bit too enthusiastic -- it looks like a great trick, very clever use of
selectany! However, the pipeline failed and after more testing on my local system I now think it does not work; any fresh thread is not getting the right initialization from the COMDAT section.
Ah, too bad. I guess that what happens if you look at generated assembly one and do too little testing 😅
But, I agree that if it could be made work, it would be a nice default - in spitting distance of the best solution (fixed TLS slot), but quite a bit more robust (especially if the desired TLS slot is not available). So, I'll think I'll try to look into this again sometime in the near future.
in spitting distance of the best solution (fixed TLS slot), but quite a bit more robust (especially if the desired TLS slot is not available). So, I'll think I'll try to look into this again sometime in the near future.
I did make the TLS slot solution a tad less efficient by doing an extra load, but now it can use any slot returned by TlsAlloc which makes it much more robust (commit 52b7569). The one remaining challenge is that we need to be able to allocate this slot before any allocation happened and thus hook real early into the initialization process. However, it seems we can nowadays do this reliably on Windows with both static and dynamic overriding so I think the MI_WIN_USE_FIXED_TLS option is becoming quite safe to use. I added this to the pipeline testing now as well.
I did make the TLS slot solution a tad less efficient by doing an extra load, but now it can use any slot returned by
TlsAllocwhich makes it much more robust (commit 52b7569).
Sounds good.
From that description, the generated code should be about the same as when using thread_local, or?
This would be my concern as well, although I don't think it's a common thing to have 2 independent versions of a custom allocator in the same process.
Just for reference: This does happen for us. We embed cpython which is using mimalloc as allocator. At the same time we are also using mimalloc as our allocator as well.
This would be my concern as well, although I don't think it's a common thing to have 2 independent versions of a custom allocator in the same process.
Just for reference: This does happen for us. We embed cpython which is using mimalloc as allocator. At the same time we are also using mimalloc as our allocator as well.
Great -- the latest changes make the slot allocation dynamic (at the cost of one more load) which accomodates having multiple instances of mimalloc active.
Great -- the latest changes make the slot allocation dynamic (at the cost of one more load) which accomodates having multiple instances of mimalloc active.
Right now we are mostly doing this on unix systems. I assume there the pthread_key_create was always allowing for multiple instances right?
@maxbachmann : yes, on unix etc. there is no issue for multiple instances either;