mimalloc Using mimalloc as system allocator

Is mimalloc in a state where I could use it as the system allocator?

Haiku is looking at options to replace the current hoard2 allocator in it's libc equivalent (libroot). I've given it a try, but might be passing the wrong defines/flags to make it work correctly, as I'm running into page faults. Should I be using a 1.x branch instead?

Nov 12 '22 23:11 jessicah

Ah very cool! It should in principle just work out-of-the-box -- if you get faults that usually indicates trouble with getting the right thread id, see mimalloc-internal.h:_mi_get_threadid(). Ensure you are getting into the "generic" path using a thread-local variable (maybe? maybe we should use the right TLS slot for Haiku instead?)

Also, try the latest dev-slice branch first (that is the latest v2). Of course, also try v1 first if you can but that should not make any difference with regard to page-faults (or I would by quite surprised :-) ).

If this still does not work, I need more info about the crash -- maybe it is a startup issue as libc may not yet be loaded? or a stack overflow? (this can happen when accessing a thread-local variable for the first time causes the C runtime to execute which on some OS's (macOSX) needs to allocate, entering mimalloc again, which accesses the same thread local, etc etc).

Nov 22 '22 18:11 daanx

Ah, I see. My assembly-fu is a bit sub-par. From what I can tell, it's the x86_64 Linux case in mi_tls_slot, and the Bionic case in mi_thread_id.

https://github.com/haiku/haiku/blob/master/src/system/libroot/os/arch/x86_64/thread.cpp#L12.

Nov 24 '22 00:11 jessicah

Well, after disabling over-commit, which Haiku doesn't use, and adding the thread local support, I can now boot Haiku, but I run into some weird memory corruption issues in a couple odd places. Would a debug report from Haiku be useful? It happens in OpenSSL tear-down on process exit (__cxa_finalize) in strcasecmp. Although this seems to be OpenSSL specific? Also a weird issue when mounting a partition, it ends up with a corrupt path there. I wonder if using secure mode could help diagnose? Be interesting if these are actual bugs in Haiku...

It's hopefully not a link order issue, as it's linked directly into libroot, Haiku's equivalent of libc+libm+libdl+libpthread all in a single system library.

Nov 24 '22 00:11 jessicah

Ah, I see. My assembly-fu is a bit sub-par. From what I can tell, it's the x86_64 Linux case in mi_tls_slot, and the Bionic case in mi_thread_id.

https://github.com/haiku/haiku/blob/master/src/system/libroot/os/arch/x86_64/thread.cpp#L12.

Great! Yes, use the BIONIC case in thread_id which uses the fs:8 slot like in the code you linked. (we should probably add this case anyways for Haiku)

Being able to boot is already great -- the best way to uncover bugs is just compiling mimalloc in debug mode (setting MI_DEBUG=3 (or even using cmake -DMI_DEBUG_FULL=ON (which sets MI_DEBUG=4). The 4 setting is a bit over the top as it checks also all mimallocs internal structures which may be too slow).

Debug 3 verifies for memory corruption and heap block overflows so it may uncover bugs. How will you see the assertion failure though when booting? :-). (btw. Secure mode works as well but debug mode does everything secure mode does and more, and may give more information)

Let me know how it goes.

Nov 24 '22 01:11 daanx

(also, if you can run valgrind, that is great way to uncover bugs but I guess it won't work for the system allocator)

Nov 24 '22 01:11 daanx

A few other questions :)

Haiku has a couple before/after style hooks that mimalloc could hook into (that the current hoard2 allocator makes use of):

__init_heap
__heap_terminate_after
__heap_before_fork
__heap_after_fork_child
__heap_after_fork_parent
__heap_thread_init
__heap_thread_exit

It seems like using some of these might be a good idea to avoid what looks like could be the possibility of reentrant malloc calls within mimalloc?

Also, for TLS, I could easily add a new reserved TLS key at https://github.com/haiku/haiku/blob/master/headers/private/system/tls.h#L15. And possibly directly invoke thread init from the __heap_thread_init hook?

Nov 24 '22 21:11 jessicah

hooks: that looks useful; you can always call mi_process_init from __init_heap , and mi_thread_init and mi_thread_done from the __heap_thread_init/exit hooks I think. If those hooks are really called as early as possible before any allocation can happen in the process or thread, that would be ideal. Moreover, in the init.c file you can see that we usually set up posix thread local variables (pthread_key_t) just to be able do hook into thread exit but if Haiku supplies this hook already we can avoid this :-).
TLS: there are two parts to this: 1) we need to get a unique thread id quickly but already this is possible in Haiku (in TLS slot 1). But 2) we need to get the heap belonging to a thread quickly for allocation: mi_malloc is basically mi_heap_malloc( mi_get_default_heap(), size). On Windows/Linux we use a regular thread local variable initialized with &_mi_heap_empty -- this way in the fast path we never have to check whether to initialize the heap as the empty heap will trigger the generic allocation path (which will initialize the heap if needed). (see mimalloc-internal.h for the definition of mi_get_default_heap. ).

Anyway, on other platforms without fast thread-local support, we often use a TLS slot (for example, on macOSX) but in such case we need to always test for the case where it is NULL having a small performance impact. Now, if the __heap_thread_init is called before any allocation can happen, we can just initialize a reserved TLS slot with &_mi_heap_empty and avoid this. That would be perfect and indeed avoids any re-entrancy as well.

This is all tricky -- let me know if you need more help; Thanks!

Nov 25 '22 21:11 daanx

I had been looking into switching Haiku to using an allocator derived from OpenBSD's malloc, but after more testing it appears that mimalloc actually has lower overall memory usage (and in some cases, by a significant amount). So I'm looking into ways to integrate it more closely with the OS. @X547 did some work here to make use of the hook functions described, and integrating those further seems to be straightforward enough.

The thing I am spending more time investigating is whether we can make mimalloc use Haiku-native memory primitives rather than mmap, as these can be more efficient if used properly. Specifically:

resize_area: can be used to extend or shrink an area (and all mmap calls create such "areas", they're the basic unit of virtual memory on Haiku), much cheaper than doing another mmap call (because that would create a whole separate area internally, extra bookkeeping structures, etc.) We have to know the area's ID (this can be fetched or cached based on the virtual address, not hard to do in glue code)
reserve_address_range: Reserves virtual addresses, but not memory (trying to call mprotect on it won't work, it's purely for reserving virtual addresses.) The kernel will then avoid putting anything in it unless MAP_FIXED or equivalent is used (there's another method that's equivalent to MAP_FIXED | MAP_NOREPLACE which is preferred), until the virtual address space starts getting full, at which point mappings might get placed inside it even without MAP_FIXED. (Not really an issue on 64-bit, but does happen on 32-bit.)

If a whole "area" has the exact same protections, the kernel can use fast paths on fork(), resize, etc. and not allocate/manipulate a per-page protections/permissions array, so we get a larger speedup there too.

A basic glance at area usage of app_server (Haiku's window manager + graphics server) gives this:

   ID                             name   address         size   alloc. #-cow  #-in #-out
28653             libroot.so mmap area  0x6126000000 40000000  17d5000     0     0     0
28654             libroot.so mmap area  0x958fb8f000     1000     1000     0     0     0
28696             libroot.so mmap area  0xa023ad4000     2000     2000     0     0     0
28873             libroot.so mmap area  0xa7406fb000     2000     2000     0     0     0
29372             libroot.so mmap area  0x1472e545000     2000     2000     0     0     0
29771             libroot.so mmap area  0x19da06f6000     2000     2000     0     0     0
29775             libroot.so mmap area  0x19da0ae9000     2000     2000     0     0     0
29777             libroot.so mmap area  0x19da10a4000     2000     2000     0     0     0
29778             libroot.so mmap area  0x19da118f000     2000     2000     0     0     0
29780             libroot.so mmap area  0x19da1453000     2000     2000     0     0     0
29803             libroot.so mmap area  0x1a5a7cbc000     2000     2000     0     0     0
29807             libroot.so mmap area  0x1a5b2015000     2000     2000     0     0     0
29811             libroot.so mmap area  0x1a66968b000     2000     2000     0     0     0
29851             libroot.so mmap area  0x1a693ef6000     2000     2000     0     0     0
29972             libroot.so mmap area  0x1a6adfb3000     2000     2000     0     0     0
30053             libroot.so mmap area  0x1a6ae219000     2000     2000     0     0     0
30061             libroot.so mmap area  0x1a6ae22b000     2000     2000     0     0     0
30376             libroot.so mmap area  0x1a745ac5000     2000     2000     0     0     0
30383             libroot.so mmap area  0x1a7537d2000     2000     2000     0     0     0
30386             libroot.so mmap area  0x1a75f10f000     2000     2000     0     0     0
30389             libroot.so mmap area  0x1a76218f000     2000     2000     0     0     0
31130             libroot.so mmap area  0x1a76dbde000     2000     2000     0     0     0
31133             libroot.so mmap area  0x1a76dd69000     2000     2000     0     0     0
32466             libroot.so mmap area  0x1b14d7a6000     2000     2000     0     0     0
32469             libroot.so mmap area  0x1b668475000     2000     2000     0     0     0
32473             libroot.so mmap area  0x1b700917000     2000     2000     0     0     0
35346             libroot.so mmap area  0x1ce21287000     2000     2000     0     0     0
37774             libroot.so mmap area  0x1ce254a5000     2000     2000     0     0     0
38155             libroot.so mmap area  0x1ce25590000     2000     2000     0     0     0
38769             libroot.so mmap area  0x1d0f3691000     2000     2000     0     0     0
39644             libroot.so mmap area  0x1d229767000     2000     2000     0     0     0
39671             libroot.so mmap area  0x1d229f77000     2000     2000     0     0     0
47121             libroot.so mmap area  0x614c3fbd000     2000     2000     0     0     0

Nothing else in libroot.so calls mmap here except mimalloc, so all these are areas mimalloc requested.

The first area is the most interesting, I guess that's most of the heap with 1GB reserved and ~24MB of pages actually mapped in. I did not yet check how contiguous that is; if it's all mostly contiguous then writing a wrapper that uses resize_area to commit should be easy. If it's not then that may prove more difficult. (The other two-page areas seem easy enough to coalesce.)

For the OpenBSD malloc, I came up with a process-global caching strategy (since OpenBSD malloc doesn't actually have one) that manages virtual address reservations, committing, decommitting, etc. which boils down to three methods:

__allocate_pages(size)
__allocate_pages_at(address, size) (is likely to fail unless the address is just beyond the end of an existing area/mapping)
__free_pages(address, size)

The sizes must be page-aligned but otherwise the API has no restrictions, you can free a block of pages that is smaller (or larger) than came from a single allocate call, etc.

If I tried to use an API like this as the backing to "prim.c", how well would that go? It looks like "commit" methods always take a specific address, but how often does that address actually lie just beyond the allocated region, or within a previously-allocated-but-later-decommitted block (and not somewhere random further up?)

Feb 12 '25 23:02 waddlesplash

Aha, I just noticed the WASI prim.c which uses an allocation strategy more or less like this. So I suppose this should indeed work, at least in theory.

Feb 13 '25 00:02 waddlesplash

I did some more benchmarking and it appears that while mimalloc is indeed faster than OpenBSD malloc + my caching strategy and does have lower memory use, in most cases with the basic set of applications (including web browsers) it's a few MB at most. The performance differences are a bit more significant (10% seems to be not uncommon), though.

I think in the end I'm going to move forward with OpenBSD malloc for now and put mimalloc adoption on the back burner. It might be good for us to implement it anyway and give ourselves more options (especially for larger applications that are more sensitive to malloc performance).

Feb 13 '25 00:02 waddlesplash

Hi @waddlesplash -- ha, you are moving fast ;-) The idea is that the new prim.h interface should allow this well -- thanks for overview with the primitives on haiku!. If you have issues with this, we can also meet over Skype or something to discuss it a bit -- if you happen to restart working on this. Also, did you try with the latest dev3 ? This version generally has less memory footprint than before with fine-grained purging and commit.
Best, Daan

Feb 13 '25 01:02 daanx

I didn't try with dev3, no. We've been burned a bit before in our experiment with using "rpmalloc" as the default allocator (too much "waste" of virtual/physical memory caused OOMs on lower-end systems), so even if dev3 does improve things even more here, that's not really a help to me at the moment as I'd want to wait until it was "stable" and already in production use before we moved to adopt it. But I will surely keep an eye out to see how things progress there.

Thanks for all your work on mimalloc!

Feb 13 '25 01:02 waddlesplash