Using mimalloc as system allocator
Is mimalloc in a state where I could use it as the system allocator?
Haiku is looking at options to replace the current hoard2 allocator in it's libc equivalent (libroot). I've given it a try, but might be passing the wrong defines/flags to make it work correctly, as I'm running into page faults. Should I be using a 1.x branch instead?
Ah very cool! It should in principle just work out-of-the-box -- if you get faults that usually indicates trouble with getting the right thread id, see mimalloc-internal.h:_mi_get_threadid(). Ensure you are getting into the "generic" path using a thread-local variable (maybe? maybe we should use the right TLS slot for Haiku instead?)
Also, try the latest dev-slice branch first (that is the latest v2). Of course, also try v1 first if you can but that should not make any difference with regard to page-faults (or I would by quite surprised :-) ).
If this still does not work, I need more info about the crash -- maybe it is a startup issue as libc may not yet be loaded? or a stack overflow? (this can happen when accessing a thread-local variable for the first time causes the C runtime to execute which on some OS's (macOSX) needs to allocate, entering mimalloc again, which accesses the same thread local, etc etc).
Ah, I see. My assembly-fu is a bit sub-par. From what I can tell, it's the x86_64 Linux case in mi_tls_slot, and the Bionic case in mi_thread_id.
https://github.com/haiku/haiku/blob/master/src/system/libroot/os/arch/x86_64/thread.cpp#L12.
Well, after disabling over-commit, which Haiku doesn't use, and adding the thread local support, I can now boot Haiku, but I run into some weird memory corruption issues in a couple odd places. Would a debug report from Haiku be useful? It happens in OpenSSL tear-down on process exit (__cxa_finalize) in strcasecmp. Although this seems to be OpenSSL specific? Also a weird issue when mounting a partition, it ends up with a corrupt path there. I wonder if using secure mode could help diagnose? Be interesting if these are actual bugs in Haiku...
It's hopefully not a link order issue, as it's linked directly into libroot, Haiku's equivalent of libc+libm+libdl+libpthread all in a single system library.
Ah, I see. My assembly-fu is a bit sub-par. From what I can tell, it's the x86_64 Linux case in
mi_tls_slot, and the Bionic case inmi_thread_id.https://github.com/haiku/haiku/blob/master/src/system/libroot/os/arch/x86_64/thread.cpp#L12.
Great! Yes, use the BIONIC case in thread_id which uses the fs:8 slot like in the code you linked. (we should probably add this case anyways for Haiku)
Being able to boot is already great -- the best way to uncover bugs is just compiling mimalloc in debug mode (setting MI_DEBUG=3 (or even using cmake -DMI_DEBUG_FULL=ON (which sets MI_DEBUG=4). The 4 setting is a bit over the top as it checks also all mimallocs internal structures which may be too slow).
Debug 3 verifies for memory corruption and heap block overflows so it may uncover bugs. How will you see the assertion failure though when booting? :-). (btw. Secure mode works as well but debug mode does everything secure mode does and more, and may give more information)
Let me know how it goes.
(also, if you can run valgrind, that is great way to uncover bugs but I guess it won't work for the system allocator)
A few other questions :)
Haiku has a couple before/after style hooks that mimalloc could hook into (that the current hoard2 allocator makes use of):
- __init_heap
- __heap_terminate_after
- __heap_before_fork
- __heap_after_fork_child
- __heap_after_fork_parent
- __heap_thread_init
- __heap_thread_exit
It seems like using some of these might be a good idea to avoid what looks like could be the possibility of reentrant malloc calls within mimalloc?
Also, for TLS, I could easily add a new reserved TLS key at https://github.com/haiku/haiku/blob/master/headers/private/system/tls.h#L15. And possibly directly invoke thread init from the __heap_thread_init hook?
-
hooks: that looks useful; you can always call
mi_process_initfrom__init_heap, andmi_thread_initandmi_thread_donefrom the__heap_thread_init/exithooks I think. If those hooks are really called as early as possible before any allocation can happen in the process or thread, that would be ideal. Moreover, in theinit.cfile you can see that we usually set up posix thread local variables (pthread_key_t) just to be able do hook into thread exit but if Haiku supplies this hook already we can avoid this :-). -
TLS: there are two parts to this: 1) we need to get a unique thread id quickly but already this is possible in Haiku (in TLS slot 1). But 2) we need to get the heap belonging to a thread quickly for allocation:
mi_mallocis basicallymi_heap_malloc( mi_get_default_heap(), size). On Windows/Linux we use a regular thread local variable initialized with&_mi_heap_empty-- this way in the fast path we never have to check whether to initialize the heap as the empty heap will trigger the generic allocation path (which will initialize the heap if needed). (seemimalloc-internal.hfor the definition ofmi_get_default_heap. ).
Anyway, on other platforms without fast thread-local support, we often use a TLS slot (for example, on macOSX) but in such case we need to always test for the case where it is NULL having a small performance impact. Now, if the __heap_thread_init is called before any allocation can happen, we can just initialize a reserved TLS slot with &_mi_heap_empty and avoid this. That would be perfect and indeed avoids any re-entrancy as well.
This is all tricky -- let me know if you need more help; Thanks!
I had been looking into switching Haiku to using an allocator derived from OpenBSD's malloc, but after more testing it appears that mimalloc actually has lower overall memory usage (and in some cases, by a significant amount). So I'm looking into ways to integrate it more closely with the OS. @X547 did some work here to make use of the hook functions described, and integrating those further seems to be straightforward enough.
The thing I am spending more time investigating is whether we can make mimalloc use Haiku-native memory primitives rather than mmap, as these can be more efficient if used properly. Specifically:
resize_area: can be used to extend or shrink an area (and allmmapcalls create such "areas", they're the basic unit of virtual memory on Haiku), much cheaper than doing anothermmapcall (because that would create a whole separate area internally, extra bookkeeping structures, etc.) We have to know the area's ID (this can be fetched or cached based on the virtual address, not hard to do in glue code)reserve_address_range: Reserves virtual addresses, but not memory (trying to callmprotecton it won't work, it's purely for reserving virtual addresses.) The kernel will then avoid putting anything in it unlessMAP_FIXEDor equivalent is used (there's another method that's equivalent toMAP_FIXED | MAP_NOREPLACEwhich is preferred), until the virtual address space starts getting full, at which point mappings might get placed inside it even withoutMAP_FIXED. (Not really an issue on 64-bit, but does happen on 32-bit.)
If a whole "area" has the exact same protections, the kernel can use fast paths on fork(), resize, etc. and not allocate/manipulate a per-page protections/permissions array, so we get a larger speedup there too.
A basic glance at area usage of app_server (Haiku's window manager + graphics server) gives this:
ID name address size alloc. #-cow #-in #-out
28653 libroot.so mmap area 0x6126000000 40000000 17d5000 0 0 0
28654 libroot.so mmap area 0x958fb8f000 1000 1000 0 0 0
28696 libroot.so mmap area 0xa023ad4000 2000 2000 0 0 0
28873 libroot.so mmap area 0xa7406fb000 2000 2000 0 0 0
29372 libroot.so mmap area 0x1472e545000 2000 2000 0 0 0
29771 libroot.so mmap area 0x19da06f6000 2000 2000 0 0 0
29775 libroot.so mmap area 0x19da0ae9000 2000 2000 0 0 0
29777 libroot.so mmap area 0x19da10a4000 2000 2000 0 0 0
29778 libroot.so mmap area 0x19da118f000 2000 2000 0 0 0
29780 libroot.so mmap area 0x19da1453000 2000 2000 0 0 0
29803 libroot.so mmap area 0x1a5a7cbc000 2000 2000 0 0 0
29807 libroot.so mmap area 0x1a5b2015000 2000 2000 0 0 0
29811 libroot.so mmap area 0x1a66968b000 2000 2000 0 0 0
29851 libroot.so mmap area 0x1a693ef6000 2000 2000 0 0 0
29972 libroot.so mmap area 0x1a6adfb3000 2000 2000 0 0 0
30053 libroot.so mmap area 0x1a6ae219000 2000 2000 0 0 0
30061 libroot.so mmap area 0x1a6ae22b000 2000 2000 0 0 0
30376 libroot.so mmap area 0x1a745ac5000 2000 2000 0 0 0
30383 libroot.so mmap area 0x1a7537d2000 2000 2000 0 0 0
30386 libroot.so mmap area 0x1a75f10f000 2000 2000 0 0 0
30389 libroot.so mmap area 0x1a76218f000 2000 2000 0 0 0
31130 libroot.so mmap area 0x1a76dbde000 2000 2000 0 0 0
31133 libroot.so mmap area 0x1a76dd69000 2000 2000 0 0 0
32466 libroot.so mmap area 0x1b14d7a6000 2000 2000 0 0 0
32469 libroot.so mmap area 0x1b668475000 2000 2000 0 0 0
32473 libroot.so mmap area 0x1b700917000 2000 2000 0 0 0
35346 libroot.so mmap area 0x1ce21287000 2000 2000 0 0 0
37774 libroot.so mmap area 0x1ce254a5000 2000 2000 0 0 0
38155 libroot.so mmap area 0x1ce25590000 2000 2000 0 0 0
38769 libroot.so mmap area 0x1d0f3691000 2000 2000 0 0 0
39644 libroot.so mmap area 0x1d229767000 2000 2000 0 0 0
39671 libroot.so mmap area 0x1d229f77000 2000 2000 0 0 0
47121 libroot.so mmap area 0x614c3fbd000 2000 2000 0 0 0
Nothing else in libroot.so calls mmap here except mimalloc, so all these are areas mimalloc requested.
The first area is the most interesting, I guess that's most of the heap with 1GB reserved and ~24MB of pages actually mapped in. I did not yet check how contiguous that is; if it's all mostly contiguous then writing a wrapper that uses resize_area to commit should be easy. If it's not then that may prove more difficult. (The other two-page areas seem easy enough to coalesce.)
For the OpenBSD malloc, I came up with a process-global caching strategy (since OpenBSD malloc doesn't actually have one) that manages virtual address reservations, committing, decommitting, etc. which boils down to three methods:
__allocate_pages(size)__allocate_pages_at(address, size)(is likely to fail unless the address is just beyond the end of an existing area/mapping)__free_pages(address, size)
The sizes must be page-aligned but otherwise the API has no restrictions, you can free a block of pages that is smaller (or larger) than came from a single allocate call, etc.
If I tried to use an API like this as the backing to "prim.c", how well would that go? It looks like "commit" methods always take a specific address, but how often does that address actually lie just beyond the allocated region, or within a previously-allocated-but-later-decommitted block (and not somewhere random further up?)
Aha, I just noticed the WASI prim.c which uses an allocation strategy more or less like this. So I suppose this should indeed work, at least in theory.
I did some more benchmarking and it appears that while mimalloc is indeed faster than OpenBSD malloc + my caching strategy and does have lower memory use, in most cases with the basic set of applications (including web browsers) it's a few MB at most. The performance differences are a bit more significant (10% seems to be not uncommon), though.
I think in the end I'm going to move forward with OpenBSD malloc for now and put mimalloc adoption on the back burner. It might be good for us to implement it anyway and give ourselves more options (especially for larger applications that are more sensitive to malloc performance).
Hi @waddlesplash -- ha, you are moving fast ;-) The idea is that the new prim.h interface should allow this well -- thanks for overview with the primitives on haiku!. If you have issues with this, we can also meet over Skype or something to discuss it a bit -- if you happen to restart working on this. Also, did you try with the latest dev3 ? This version generally has less memory footprint than before with fine-grained purging and commit.
Best, Daan
I didn't try with dev3, no. We've been burned a bit before in our experiment with using "rpmalloc" as the default allocator (too much "waste" of virtual/physical memory caused OOMs on lower-end systems), so even if dev3 does improve things even more here, that's not really a help to me at the moment as I'd want to wait until it was "stable" and already in production use before we moved to adopt it. But I will surely keep an eye out to see how things progress there.
Thanks for all your work on mimalloc!