mimalloc icon indicating copy to clipboard operation
mimalloc copied to clipboard

Incorrect default MI_TLS_SLOT value

Open JustasMasiulis opened this issue 7 months ago • 17 comments

The current MI_TLS_SLOT in v3 uses offset 0x888 when MI_WIN_USE_FIXED_TLS is enabled:

#define MI_TLS_SLOT     (0x888)             // Last user-reserved slot <https://en.wikipedia.org/wiki/Win32_Thread_Information_Block>
// #define MI_TLS_SLOT  (0x1678)            // Last TlsSlot (might clash with other app reserved slot)

static inline void* mi_prim_tls_slot(size_t slot) mi_attr_noexcept {
  #if (_M_X64 || _M_AMD64) && !defined(_M_ARM64EC)
  return (void*)__readgsqword((unsigned long)slot);   // direct load at offset from gs

However I don't see how this makes any sense? That seems to be middle of TEB::Win32ClientInfo or to be more exact afAsyncKeyStateRecentDown field of the tagCLIENTINFO type that Win32ClientInfo represents (at least on older windows versions, because seems like recent win11 touched that specific field). Basically if I'm not mistaken calling GetAsyncKeyState would overwrite the mimalloc TLS "slot"?

JustasMasiulis avatar Apr 18 '25 03:04 JustasMasiulis

You may well be right -- the use of static TLS on Windows is still a bit experimental (but it really improves codegen for mi_malloc so it would be nice to make it default). However, we need a fixed slot that is just for mimalloc and I am not sure how to pick the right one -- I chose 0x888 based on the info at https://en.wikipedia.org/wiki/Win32_Thread_Information_Block but clearly that seems wrong. 0x1678 might be a good choice but it may go wrong if a program reserves all user TLS slots :-(. Would you have a suggestion for a better slot?

daanx avatar May 02 '25 15:05 daanx

Maybe this is a cautionary tale that undocumented OS internals, in fact, do sometimes change 😛 (The Wikipedia articles states the TEB layout was extracted from Wine, and I'm guessing they used an older Windows as a base. The Vergilius table, on the other hand, claims to be for a quite recent version.)

Strictly speaking, there's no good default index; even if you're able to find an "unused" index, there's no guarantee it doesn't get used in the future, either because an OS component now needs it, or maybe because just stuff happens to shift around. Or well, I guess there could be a "good" default, if you manage to convince a kernel person to give you an index and the promise to never ever change it. (And even then there might be an issue if two, otherwise independent, mimalloc copies end up in the same process.)

Maybe there shouldn't be a default, rather a "bring your own TLS slot" scheme; this would at least be viable for OS components or drivers that end up mimalloc. And at least app developers that specify some slot to use should (hopefully) be aware of the risks and limitations.

Speaking about codegen - I'm a bit curious, how do the different variants (TEB use, thread_local variable, use API for slot) compare? My guess would be that they're, overhead-wise, in that order from least to most, but maybe you previously compared them more scientifically?

res2k avatar May 02 '25 16:05 res2k

I pushed an update where we always use the second user reserved slot but test at runtime if we indeed can allocate that slot. If we are initialized first (as mimalloc should) this should always hold. This seems a quite nice solution for mimalloc's case but we should test more.

Btw. here is the code generated for mi_malloc with a fixed TLS slot by msvc:

mov     r10,qword ptr gs:[1488h]         ; read the heap pointer at the TLS slot
cmp     rcx,400h                         ; too large?
ja      mi_malloc_generic 
lea     rax,[rcx+7]                      ; round by 8
shr     rax,3 
mov     rdx,qword ptr [r10+rax*8+0F0h]   ; load page from heap small page array
mov     rax,qword ptr [rdx+8]            ; rax = page->free
test    rax,rax                          ; is it NULL ?
je      mi_malloc_generic 
mov     rcx,qword ptr [rax]              ; load next free object
inc     word ptr [rdx+18h]               ; page->used++
mov     qword ptr [rdx+8],rcx            ; page->free = next;
ret

while without using a fixed TLS slot, we get much worse code (with msvc on Windows):

mov     qword ptr [rsp+8],rbx          ; set up a stack frame
push    rdi 
sub     rsp,20h 
mov     rax,qword ptr gs:[58h]         ; rax = tls array
mov     rbx,rcx 
mov     edx,dword ptr [_tls_index (07FF7B574FE30h)]  ; get static tls index
mov     rdi,qword ptr [rax+rdx*8]      ; rdi = thread local variable array
mov     eax,18h                       
cmp     byte ptr [rax+rdi],0           ; check tls initialized flag
jne     :init                          
call    __dyn_tls_on_demand_init (07FF7B573FFECh) ; initialize on demand
:init
mov     ecx,10h                        ; tls offset
mov     rcx,qword ptr [rcx+rdi]        ; rcx = thread local heap*
cmp     rbx,400h                       ; too large?
ja      mi_malloc_generic
lea     rax,[rbx+7]                    ; round to 8
shr     rax,3 
mov     rdx,qword ptr [rcx+rax*8+0F0h] 
mov     rax,qword ptr [rdx+8] 
test    rax,rax 
je      mi_malloc_generic
mov     rcx,qword ptr [rax] 
inc     word ptr [rdx+18h] 
mov     qword ptr [rdx+8],rcx 
mov     rbx,qword ptr [rsp+30h]        ; restore stack
add     rsp,20h 
pop     rdi 
ret

(almost half the instructions are just to get the thread local and the on-demand initialization)

daanx avatar May 02 '25 23:05 daanx

Maybe this is a cautionary tale that undocumented OS internals, in fact, do sometimes change 😛

Microsoft basically maintains backwards compatibility of TEB data structure. Generally fields don't shift and simply become unused.

(The Wikipedia articles states the TEB layout was extracted from Wine, and I'm guessing they used an older Windows as a base. The Vergilius table, on the other hand, claims to be for a quite recent version.)

The Wine TEB isn't 1:1 to actual Windows TEB and whoever put it on Wikipedia as if it was, is responsible for the misunderstanding. Fact of the matter is that the field in question has been the same since at last Windows 2000, if not NT 4.0.

And even then there might be an issue if two, otherwise independent, mimalloc copies end up in the same process.

This would be my concern as well, although I don't think it's a common thing to have 2 independent versions of a custom allocator in the same process.

Maybe there shouldn't be a default, rather a "bring your own TLS slot" scheme; this would at least be viable for OS components or drivers that end up mimalloc. And at least app developers that specify some slot to use should (hopefully) be aware of the risks and limitations.

A general user has very little understanding of what fields in TEB are actually unused or their lifetime in the process to make a good choice.

JustasMasiulis avatar May 03 '25 00:05 JustasMasiulis

You may well be right -- the use of static TLS on Windows is still a bit experimental (but it really improves codegen for mi_malloc so it would be nice to make it default). However, we need a fixed slot that is just for mimalloc and I am not sure how to pick the right one -- I chose 0x888 based on the info at https://en.wikipedia.org/wiki/Win32_Thread_Information_Block but clearly that seems wrong. 0x1678 might be a good choice but it may go wrong if a program reserves all user TLS slots :-(. Would you have a suggestion for a better slot?

From a quick look for maximum compatibility, I think Vdm pointer at 0x1690 for 64 bit might be a good choice. It is consistent with Wine for Linux. And even on 32 bit, I doubt anyone would use mimalloc and decide that they need to run a 16 bit dos program in the same process on an old version of windows...

while without using a fixed TLS slot, we get much worse code (with msvc on Windows):

I think part of the reason why the codegen is bad is that the initializer for the thread local is non-trivial. Have you tried making the thread local pointers initialize to nullptr and then setting them up in the TLS callback like in the raw TLS slot version?

JustasMasiulis avatar May 03 '25 01:05 JustasMasiulis

I pushed an update where we always use the second user reserved slot but test at runtime if we indeed can allocate that slot. If we are initialized first (as mimalloc should) this should always hold. This seems a quite nice solution for mimalloc's case but we should test more.

Neat, clever!

while without using a fixed TLS slot, we get much worse code (with msvc on Windows):

Curious. I tried with a simple thread_local variable on godbolt, and accessing that was about the same as accessing the TEB. (But quite possibly my experiment was too simple.)

res2k avatar May 03 '25 01:05 res2k

Curious. I tried with a simple thread_local variable on godbolt, and accessing that was about the same as accessing the TEB. (But quite possibly my experiment was too simple.)

Hmm... looks like if you have an extern thread_local variable, MSVC plays it safe and emits the dynamic TLS initialization... but if the definition is not extern, there seems to be a chance MSVC omits that call...

res2k avatar May 04 '25 13:05 res2k

So I tried to avoid extern thread_local:

diff --git a/include/mimalloc/internal.h b/include/mimalloc/internal.h
index 9c08515e..082edffc 100644
--- a/include/mimalloc/internal.h
+++ b/include/mimalloc/internal.h
@@ -26,7 +26,7 @@ terms of the MIT license. A copy of the license can be found in the file
 #define mi_decl_noinline        __declspec(noinline)
 #define mi_decl_thread          __declspec(thread)
 #define mi_decl_align(a)        __declspec(align(a))
-#define mi_decl_weak
+#define mi_decl_weak            __declspec(selectany)
 #define mi_decl_hidden
 #elif (defined(__GNUC__) && (__GNUC__ >= 3)) || defined(__clang__) // includes clang and icc
 #define mi_decl_noinline        __attribute__((noinline))
diff --git a/include/mimalloc/prim.h b/include/mimalloc/prim.h
index 7fbeef2a..ec4c2298 100644
--- a/include/mimalloc/prim.h
+++ b/include/mimalloc/prim.h
@@ -270,7 +270,11 @@ static inline void mi_prim_tls_slot_set(size_t slot, void* value) mi_attr_noexce
 
 
 // defined in `init.c`; do not use these directly
+#if defined(_MSC_VER)
+mi_decl_weak mi_decl_thread mi_heap_t* _mi_heap_default = (mi_heap_t*)&_mi_heap_empty;
+#else
 extern mi_decl_hidden mi_decl_thread mi_heap_t* _mi_heap_default;  // default heap to allocate from
+#endif
 extern mi_decl_hidden bool _mi_process_is_initialized;             // has mi_process_init been called?
 
 static inline mi_threadid_t __mi_prim_thread_id(void) mi_attr_noexcept;
diff --git a/src/init.c b/src/init.c
index 892f4988..75dbe5c0 100644
--- a/src/init.c
+++ b/src/init.c
@@ -183,8 +183,10 @@ mi_threadid_t _mi_thread_id(void) mi_attr_noexcept {
   return _mi_prim_thread_id();
 }

+#ifndef _MSC_VER
 // the thread-local default heap for allocation
 mi_decl_thread mi_heap_t* _mi_heap_default = (mi_heap_t*)&_mi_heap_empty;
+#endif


 bool _mi_process_is_initialized = false;  // set to `true` in `mi_process_init`.

The generated assembly seems much nicer, closer to the fixed TLS slot case:

;	COMDAT mi_mallocn
_TEXT	SEGMENT
count$ = 8
size$ = 16
mi_mallocn PROC						; COMDAT
; File D:\sources\mimalloc\include\mimalloc\prim.h

; 425  :   return _mi_heap_default;

	mov	r8d, DWORD PTR _tls_index
	mov	rax, QWORD PTR gs:88
	mov	r9d, OFFSET FLAT:?_mi_heap_default@@3PEAUmi_heap_s@@EA ; _mi_heap_default
	mov	rax, QWORD PTR [rax+r8*8]
; File D:\sources\mimalloc\src\alloc.c

; 243  :   return mi_heap_mallocn(mi_prim_get_default_heap(),count,size);

	mov	r8, rdx
	mov	rdx, rcx
	mov	rcx, QWORD PTR [r9+rax]
	jmp	mi_heap_mallocn
mi_mallocn ENDP
_TEXT	ENDS

res2k avatar May 04 '25 13:05 res2k

Another detail: the "initialize on demand" stuff is not emitted by MSVC when compiling in C mode. Makes sense: dynamic variable initializers are not allowed in C, so no need to (possibly) account for them.

Conversely, it's possible that extern thread_local also causes code for "initialize on demand" being emitted on other compilers when compiling in C++ mode. Didn't test that, though.

res2k avatar May 04 '25 18:05 res2k

Hi @res2k: ah I was a bit too enthusiastic -- it looks like a great trick, very clever use of selectany! However, the pipeline failed and after more testing on my local system I now think it does not work; any fresh thread is not getting the right initialization from the COMDAT section. Not sure how that would work and I think the selectany may just not work with thread_local. To make it work, we need to initialize it before any allocations... and if we start doing this I feel the TLS slot technique is actually a better solution. For now, I will revert the selectany trick unfortunately :-(

daanx avatar May 05 '25 03:05 daanx

Hi @res2k: ah I was a bit too enthusiastic -- it looks like a great trick, very clever use of selectany! However, the pipeline failed and after more testing on my local system I now think it does not work; any fresh thread is not getting the right initialization from the COMDAT section.

Ah, too bad. I guess that what happens if you look at generated assembly one and do too little testing 😅

But, I agree that if it could be made work, it would be a nice default - in spitting distance of the best solution (fixed TLS slot), but quite a bit more robust (especially if the desired TLS slot is not available). So, I'll think I'll try to look into this again sometime in the near future.

res2k avatar May 05 '25 12:05 res2k

Pipeline failures with selectany, for reference: 258.txt 265.txt

res2k avatar May 05 '25 12:05 res2k

in spitting distance of the best solution (fixed TLS slot), but quite a bit more robust (especially if the desired TLS slot is not available). So, I'll think I'll try to look into this again sometime in the near future.

I did make the TLS slot solution a tad less efficient by doing an extra load, but now it can use any slot returned by TlsAlloc which makes it much more robust (commit 52b7569). The one remaining challenge is that we need to be able to allocate this slot before any allocation happened and thus hook real early into the initialization process. However, it seems we can nowadays do this reliably on Windows with both static and dynamic overriding so I think the MI_WIN_USE_FIXED_TLS option is becoming quite safe to use. I added this to the pipeline testing now as well.

daanx avatar May 05 '25 14:05 daanx

I did make the TLS slot solution a tad less efficient by doing an extra load, but now it can use any slot returned by TlsAlloc which makes it much more robust (commit 52b7569).

Sounds good. From that description, the generated code should be about the same as when using thread_local, or?

res2k avatar May 05 '25 17:05 res2k

This would be my concern as well, although I don't think it's a common thing to have 2 independent versions of a custom allocator in the same process.

Just for reference: This does happen for us. We embed cpython which is using mimalloc as allocator. At the same time we are also using mimalloc as our allocator as well.

maxbachmann avatar May 07 '25 08:05 maxbachmann

This would be my concern as well, although I don't think it's a common thing to have 2 independent versions of a custom allocator in the same process.

Just for reference: This does happen for us. We embed cpython which is using mimalloc as allocator. At the same time we are also using mimalloc as our allocator as well.

Great -- the latest changes make the slot allocation dynamic (at the cost of one more load) which accomodates having multiple instances of mimalloc active.

daanx avatar May 14 '25 01:05 daanx

Great -- the latest changes make the slot allocation dynamic (at the cost of one more load) which accomodates having multiple instances of mimalloc active.

Right now we are mostly doing this on unix systems. I assume there the pthread_key_create was always allowing for multiple instances right?

maxbachmann avatar May 14 '25 07:05 maxbachmann

@maxbachmann : yes, on unix etc. there is no issue for multiple instances either;

daanx avatar Jun 06 '25 23:06 daanx