mimalloc
mimalloc copied to clipboard
Segmentation fault on Apple Silicon / arm64
Building mimalloc fails in the TLS initialization with a segmentation fault:
% lldb ./mimalloc-test-api
(lldb) target create "./mimalloc-test-api"
Current executable set to '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64).
(lldb) run
Process 17703 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64)
mimalloc-test-api was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 17703 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x000000010001f264 mimalloc-test-api`_mi_process_init [inlined] mi_tls_slot(slot=0) at mimalloc-internal.h:711:9 [opt]
708 #elif defined(__aarch64__)
709 void** tcb; UNUSED(ofs);
710 __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb));
-> 711 res = tcb[slot];
712 #endif
713 return res;
714 }
Target 0: (mimalloc-test-api) stopped.
Out of curiosity, would the dev branch still fails ?
Fails with exactly the same issue:
Process 18154 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64)
mimalloc-test-api was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 18154 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x000000010001efa0 mimalloc-test-api`_mi_process_init [inlined] mi_tls_slot(slot=0) at mimalloc-internal.h:712:9 [opt]
709 #elif defined(__aarch64__)
710 void** tcb; UNUSED(ofs);
711 __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb));
-> 712 res = tcb[slot];
713 #endif
714 return res;
715 }
Target 0: (mimalloc-test-api) stopped.
(lldb) p slot
(size_t) $0 = 0
(lldb) p tcb
(void **) $1 = 0x0000000000000000
weird because the assembly is supposed to be different.
weird because the assembly is supposed to be different.
As far as I see in the diff to master, some of the x86 assembly has changed but the aarch64 assembly stayed the same. As the x86_64 code has special casings for __MACH__, I guess we need that for __MACH__ && __aarch64__, too.
Right ... what happens if you use this change :
uintptr_t tcb; UNUSED(ofs);UNUSED(slot); __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb)); res = tcb;
Just theorical, I wish I had a M1 to test for real :-)
Seems to pass and now faults in mi_tls_slot_set where we have the near exact same code path.
Yes it would be similar-ish solution if it works like :
uintptr_t tcb; UNUSED(ofs);UNUSED(slot) __asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tcb)); tcb = (uintptr_t)value;
Thanks for your patience :-)
With MI_MALLOC_OVERRIDE=OFF, ./mimalloc-test-api now passes.
With MI_MALLOC_OVERRIDE=ON, we now get to:
(lldb) run
Process 27086 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-api' (arm64)
Process 27086 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xb0)
frame #0: 0x0000000180cbc6cc libsystem_platform.dylib`_platform_memset + 108
libsystem_platform.dylib`_platform_memset:
-> 0x180cbc6cc <+108>: stp q0, q0, [x0]
0x180cbc6d0 <+112>: stp q0, q0, [x0, #0x20]
0x180cbc6d4 <+116>: add x3, x0, #0x40 ; =0x40
0x180cbc6d8 <+120>: and x3, x3, #0xffffffffffffffc0
Target 0: (mimalloc-test-api) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xb0)
* frame #0: 0x0000000180cbc6cc libsystem_platform.dylib`_platform_memset + 108
frame #1: 0x0000000100008038 mimalloc-test-api`mi_stats_reset at stats.c:359:35 [opt]
frame #2: 0x000000010001ec2c mimalloc-test-api`mi_process_init at init.c:476:3 [opt]
frame #3: 0x000000010007377c dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 468
frame #4: 0x0000000100073b94 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 56
frame #5: 0x000000010006d84c dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 620
frame #6: 0x000000010006b300 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 192
frame #7: 0x000000010006b3cc dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 96
frame #8: 0x000000010005684c dyld`dyld::initializeMainExecutable() + 220
frame #9: 0x000000010005cb98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 7388
frame #10: 0x0000000100055258 dyld`dyldbootstrap::start(dyld3::MachOLoaded const*, int, char const**, dyld3::MachOLoaded const*, unsigned long*) + 476
frame #11: 0x0000000100055038 dyld`_dyld_start + 56
Too bad the changes work on Linux arm64 maybe the M1 gear needs a specific code path in this case. Just an idea, what happens if you comment line 298 in include/mimalloc-internal.h
Just wanted to see while mimalloc uses the thread slot maybe that s not such good thing for the M1 case. Maybe I m wrong tough, lot of maybe at the moment :-)
When I comment line 298, the mimalloc-api-test passes but the mimalloc-test-stress still fails which seems to be related to the commenting of line 298.
(lldb) run
Process 30829 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-stress' (arm64)
Process 30829 stopped
* thread #1, stop reason = signal SIGABRT
frame #0: 0x0000000180c43cec libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
-> 0x180c43cec <+8>: b.lo 0x180c43d0c ; <+40>
0x180c43cf0 <+12>: pacibsp
0x180c43cf4 <+16>: stp x29, x30, [sp, #-0x10]!
0x180c43cf8 <+20>: mov x29, sp
Target 0: (mimalloc-test-stress) stopped.
(lldb) bt
* thread #1, stop reason = signal SIGABRT
* frame #0: 0x0000000180c43cec libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x0000000180c74c24 libsystem_pthread.dylib`pthread_kill + 292
frame #2: 0x0000000180bbc864 libsystem_c.dylib`abort + 104
frame #3: 0x0000000180c93d58 libdyld.dylib`_tlv_bootstrap + 16
frame #4: 0x000000010012b208 libmimalloc-debug.1.6.dylib`mi_calloc [inlined] mi_get_default_heap at mimalloc-internal.h:346:10 [opt]
frame #5: 0x000000010012b1f8 libmimalloc-debug.1.6.dylib`mi_calloc(count=<unavailable>, size=80) at alloc.c:584 [opt]
frame #6: 0x000000018a2b5a7c libkeymgr.dylib`dwarf2_unwind_dyld_add_image_hook + 40
frame #7: 0x0000000100018d94 dyld`dyld::registerAddCallback(void (*)(mach_header const*, long)) + 400
frame #8: 0x0000000180c7b1c8 libdyld.dylib`_dyld_register_func_for_add_image + 132
frame #9: 0x000000018a2b5798 libkeymgr.dylib`__keymgr_initializer + 32
frame #10: 0x000000018a2c67d0 libSystem.B.dylib`libSystem_initializer + 196
frame #11: 0x000000010003390c dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 868
frame #12: 0x0000000100033b94 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 56
frame #13: 0x000000010002d84c dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 620
frame #14: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
frame #15: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
frame #16: 0x000000010002b300 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 192
frame #17: 0x000000010002b3cc dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 96
frame #18: 0x000000010001684c dyld`dyld::initializeMainExecutable() + 220
frame #19: 0x000000010001cb98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 7388
frame #20: 0x0000000100015258 dyld`dyldbootstrap::start(dyld3::MachOLoaded const*, int, char const**, dyld3::MachOLoaded const*, unsigned long*) + 476
frame #21: 0x0000000100015038 dyld`_dyld_start + 56
Right so maybe M1 needs a specific code path (specific register like x86 case ?). I see on other projects issues popping up about M1 here and there seems to be really a particular animal.
Hi @xhochy , lucky you with an M1 :-) Thanks @devnexen for looking into this.
Unfortunately, there have been various issues on macOS and it may be related to that and not directly to the M1; I currently have a mac mini to develop on so I will first go through previous issues and then get back to this. The main trouble on macOS is that the loader itself (dyld_ImageLoader) calls malloc to allocate memory for thread local storage -- causing an infinite loop if any thread-local storage is accessed. Currently, we avoid this by using a dedicated thread slot but that is not guaranteed to work on future OS's and may be part of the problem -- maybe you can try another slot number? a few good ones are in the comments in mimalloc-internal.h.
If you get to it, can you submit the assembly changes together with the correct pre-processor test for the M1?
Hope we can get this working! (I tested mimalloc extensively on aarch64 on Linux, including long TSAN tests and there was recent work with model-checking the atomics so pretty sure that part is working ok)
@xhochy, I just pushed an update for macOS ; can you try again with the latest dev ? (may not make a difference but then we know at least it is not a bug in the malloc zones overriding)
@xhochy, I just pushed an update for macOS ; can you try again with the latest
dev? (may not make a difference but then we know at least it is not a bug in the malloc zones overriding)
Latest dev didn't make a difference. Exactly the same behaviour as before.
Currently, we avoid this by using a dedicated thread slot but that is not guaranteed to work on future OS's and may be part of the problem -- maybe you can try another slot number?
Also didn't make any difference, with or without the suggestions by @devnexen.
In general this looks like the same issue as people are facing on iOS (see https://github.com/microsoft/mimalloc/issues/262 and I tried to use the fixes suggested there.
Thus I defined MI_TLS_RECURSE_GUARD in my C(XX)FLAGS and changed the tcb acess to happen when tcb is not-NULL. Also I commented out MI_TLS_SLOT.
This then leads to a slight different error:
(lldb) run
Process 61782 launched: '/Users/uwe/Development/mimalloc/build/mimalloc-test-stress' (arm64)
objc[61782]: realized class 0x1e8e712c8 has corrupt data pointer 0x40000040090
Process 61782 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
frame #0: 0x0000000180c64130 libsystem_kernel.dylib`__abort_with_payload + 8
libsystem_kernel.dylib`__abort_with_payload:
-> 0x180c64130 <+8>: b.lo 0x180c64150 ; <+40>
0x180c64134 <+12>: pacibsp
0x180c64138 <+16>: stp x29, x30, [sp, #-0x10]!
0x180c6413c <+20>: mov x29, sp
Target 0: (mimalloc-test-stress) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
* frame #0: 0x0000000180c64130 libsystem_kernel.dylib`__abort_with_payload + 8
frame #1: 0x0000000180c66a20 libsystem_kernel.dylib`abort_with_payload_wrapper_internal + 104
frame #2: 0x0000000180c669b8 libsystem_kernel.dylib`abort_with_reason + 32
frame #3: 0x0000000180b311dc libobjc.A.dylib`_objc_fatalv(unsigned long long, unsigned long long, char const*, char*) + 120
frame #4: 0x0000000180b31164 libobjc.A.dylib`_objc_fatal(char const*, ...) + 44
frame #5: 0x0000000180b20f7c libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 112
frame #6: 0x0000000180b21048 libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 316
frame #7: 0x0000000180b21060 libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 340
frame #8: 0x0000000180b21048 libobjc.A.dylib`realizeClassWithoutSwift(objc_class*, objc_class*) + 316
frame #9: 0x0000000180b0b564 libobjc.A.dylib`_read_images + 2684
frame #10: 0x0000000180b0a44c libobjc.A.dylib`map_images_nolock + 2468
frame #11: 0x0000000180b1b770 libobjc.A.dylib`map_images + 92
frame #12: 0x000000010001a16c dyld`dyld::notifyBatchPartial(dyld_image_states, bool, char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*), bool, bool) + 1672
frame #13: 0x000000010001a358 dyld`dyld::registerObjCNotifiers(void (*)(unsigned int, char const* const*, mach_header const* const*), void (*)(char const*, mach_header const*), void (*)(char const*, mach_header const*)) + 80
frame #14: 0x0000000180c7bd88 libdyld.dylib`_dyld_objc_notify_register + 164
frame #15: 0x0000000180b09a08 libobjc.A.dylib`_objc_init + 1384
frame #16: 0x0000000180ac7768 libdispatch.dylib`_os_object_init + 24
frame #17: 0x0000000180ad632c libdispatch.dylib`libdispatch_init + 476
frame #18: 0x000000018a2c67e8 libSystem.B.dylib`libSystem_initializer + 220
frame #19: 0x000000010003390c dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 868
frame #20: 0x0000000100033b94 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 56
frame #21: 0x000000010002d84c dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 620
frame #22: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
frame #23: 0x000000010002d794 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 436
frame #24: 0x000000010002b300 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 192
frame #25: 0x000000010002b3cc dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 96
frame #26: 0x000000010001684c dyld`dyld::initializeMainExecutable() + 220
frame #27: 0x000000010001cb98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 7388
frame #28: 0x0000000100015258 dyld`dyldbootstrap::start(dyld3::MachOLoaded const*, int, char const**, dyld3::MachOLoaded const*, unsigned long*) + 476
frame #29: 0x0000000100015038 dyld`_dyld_start + 56
I guess that the fix for the M1 and iOS will be the exact same as they both use similar execution models and the same chip design ;)
I just did a run with an iPhone 6 :

with the actual dev but with these little changes already mentioned

Yes, that also works me for. Only mimalloc-test-stress is failing for me.
Was wondering if, only in the case of darwin ARM devices, we would use, tpidrr0_el0 TID register instead ? tpidr_el0 always return 0 otherwise.
Hmm, I wish I had an M1 to test with; It is always tricky on the mac as the loader calls malloc to initialize thread local state but with the right tcb slot it should be ok. Perhaps the tpidrr0_el0 register should be used as @devnexen suggests?
This is tricky to debug but it would be great if you can find the culprit :-) Without an M1 I available it is hard for me to guess what is going wrong :-(
tpidrro_el0 seems to work fine on macOS, Linux only seems to set tpidr_el0 so I special cased that.
They're completely unrelated registers that are both available. Linux and FreeBSD leave it entirely up to userspace to manage TLS how it sees fit, and so userspace needs a writable register not a read-only one. On macOS/Mach-O TLS used to be more bolted on the side (and came extremely late) unlike the implementation for ELF, and it appears it still is. But aside from that, my guess is that for dubious "security" reasons they decided to have the kernel play a larger part in controlling userspace's TLS and, whilst userspace can read it quickly, writing it needs to go through the kernel (which isn't exactly a hot path, but also there's not much good reason for it, it's just unnecessary overhead, especially since you still need to context switch tpidr_el0 given it's writable, yet if you have userspace use that one you can avoid needing to context switch tpidrro_el0). The existence of both in the Arm architecture is a bit of a wart that should never have happened (and especially not with so similar names), but here we are.
So, both choose which register to use based on their requirements and design. Neither is right or wrong, nor is either a special case, they're just different. Having said that, macOS is the odd one out for the Unixes.
I believe we figured out a way to make this work -- thanks so much! Really nice to see how this could be patched and merged without me having an M1 available -- the power of the internet :-) Thanks again!
@daanx,
I bought an M1 for testing because I contribute to Open Source projects. You are welcomed to use it. Send over your authorized_keys for SSH access (sorry, public key only). My email is noloader, gmail account.
The GCC Compile Farm also has a M1 for testing. You can get an account at https://cfarm.tetaneutral.net/; see the "Request and Account" link at the top of the page. Through the Compile Farm you get access to a lot of machines including ARM64, POWER8, POWER9 and Sparc. OSes include AIX, Linux, OS X, the BSDs and Solaris.
Hi @xhochy -- this has taken awhile but I think I fixed the issue in the latest releases (v1.7.2, v2.0.2).
The issue was two-fold; we need indeed the tldrro_el0 register but it seems we need to clear the bottom 3 bits; moreover, we need to not use interpose anymore; i think other fixes made the zones overriding now work reliably and on the M1 interpose causes some sort of problems now (not sure what). With interpose off, everything seems to work :-)
I did some quick testing and mimalloc is often twice as fast (up to 10x in xmalloc-test) than the system allocator:
# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
cfrac sys 06.13 2208 6.12 0.00 4 304
cfrac xmi 03.61 2464 3.61 0.00 0 334
cfrac je 04.23 2864 4.22 0.00 0 364
espresso sys 05.26 2208 5.25 0.01 12 296
espresso xmi 03.73 4480 3.71 0.01 0 460
espresso je 04.06 3968 4.05 0.01 0 433
xmalloc-testN sys 7.854 49760 29.27 10.52 3 5632
xmalloc-testN xmi 0.601 116096 39.53 0.15 0 7452
xmalloc-testN je 0.794 70272 32.77 1.78 0 4596
mstressN sys 03.24 1061712 4.39 2.10 0 657980
mstressN xmi 01.65 717056 3.28 0.18 0 68190
mstressN je 04.05 732992 3.37 3.49 0 1327397
I tested the latest dev branch (commit 076f81) with -DMI_OSX_ZONE=ON -DMI_INTERPOSE=OFF, and both Release and Debug build worked well.
Environment: Apple M1 (MacBook Air), macOS v11.4
any progress on it?