mimalloc icon indicating copy to clipboard operation
mimalloc copied to clipboard

Customizing mimalloc to enable store and restore of a memory heap from one process to another for same binary

Open beyonddream opened this issue 4 years ago • 4 comments

Hi,

At work, we are looking into experimenting with mimalloc as a general purpose memory allocator and possible replacement for jemalloc (from initial benchmark, it is showing promise and we see around ~1.5GB lesser memory consumption on avg. for a long running process). In addition, we are working on a project to improve startup time of our application. Currently the app is deployed in multiple machines (same/similar machine characteristics) and they all process a bunch of files before beginning to take user requests. This takes a long time on each machine and we are looking to speed up this process by making just one machine (producer) process the files and have its processed memory dump to file and directly load the file backed memory in other machines (consumers) so they can avoid the too long startup time. There is a separate effort to write a custom allocator for this use case but we wondered whether mimalloc can be customized to support it and started working on the PoC.

My current approach:

On Producer machine-

  1. Create a new heap via mi_heap_new() that returns huge chunk of OS memory at a fixed base address, that is then used to process these files . (This also requires modification to mmap routine to use MAP_FIXED since we want to ensure OS returns address at given fixed base address)
  2. Enable this new heap as current heap.
  3. Any subsequent malloc()/free() will come from this heap.
  4. When producer is done, then dump the whole heap in a file and ship them to the consumers.

On Consumer machines-

  1. Create a new heap via mi_heap_new() and load the memory backed by file at the same fixed base address as on producer machine. (Note: This is crucial so all the internal pointers will be relative to the fixed base address and doesn't need to be changed/patched.)
  2. Enable this new heap as current heap.
  3. From now on, any subsequent malloc()/free() can come from this heap but also the app can use the existing memory objects created on the producer machines. (In a way, the idea is to patch up the mimalloc internal states so that it continues to work as if it had created these memory objects).

First of all, I would like to get feedback on this approach and also want to know if there are any alternate approach based on existing mimalloc features/api that will make this implementation easier. Also, given that this feature might be useful for others, is this something the upstream maintainers interested in merging to mimalloc ?

Please let me know if you have any questions!

beyonddream avatar Apr 01 '21 17:04 beyonddream

@daanx any thoughts or comments on the above please ?

beyonddream avatar Apr 12 '21 19:04 beyonddream

Based on my latest findings, it looks like currently the mi_heap_new() doesn't automatically use a new segment that I created (backed with huge OS allocation ~ 2GB) but still uses the thread local segments (mi_heap_s->tld->segments) which it shares with other heaps and from which it services malloc() calls. Every run produces slices starting at different addresses which makes sense. Currently I don't see any way to override it with fixed address segment without affecting other heaps as well. It also seem to require more involved changes to the internals of mimalloc. It would be great if you can validate this and/or provide any other options that make this easier to implement.

beyonddream avatar Apr 12 '21 19:04 beyonddream

Hi @beyonddream, very interesting -- but tricky... recently we added the mi_manage_os_memory which can be used to assign a large memory arena; So, in the producer, one can

  • allocate using direct OS calls memory at a fixed address
  • use mi_manage_os_memory to add that as memory arena to mimalloc and set the option mi_option_limit_os_alloc to stop mimalloc from allocating from the OS itself.
  • at the end write out the memory area to file.

Now, all memory for all heaps come from that area but there are some parts of the state that you need to save/restore which may be more tricky:

  • the threadid's: mimalloc uses the threadid to check multi-threaded access in the heaps and segments; these need to be saved/restored differently; actually, you can use the heap and segment walking functions to do this I think.
  • same for the security keys if using secure mode (as those use randomness)
  • there is a static structure (in init.c) for the initial main heap that points to the rest; this needs to be saved/restored as well but again, very carefully :-)

Not sure of this covers it; it will be tricky to do this but might perhaps work

daanx avatar Apr 28 '21 19:04 daanx

It's not clear to me that the mi_manage_os_memory() feature works. It would seem to only return normal mi_malloc() space, not the space I hand to it. The argument flags and state of mi_option_limit_os_alloc() would not seem to matter. Trying to replace a deprecated feature to reserve space in a kernel with MEMMAP= and then allocate space out of it for use in multiple user-mode middlewares. A hard example would be helpful.

$ sudo sh -c 'env RESERVE_STALL=1 MIMALLOC_SHOW_STATS=1 LD_LIBRARY_PATH=/usr/local/lib64 ./reserve'
allocations=0x27e4e010000
resPlace=0x100000000 resStart=0x7f09dc94b000 resEnd=0x7f09e094b000
mi_option_limit_os_alloc=0 setting to 1
allocation 0 881664 0x27e4e020000 0xffff8374716d5000                            not in reserve space

off_t resPlace = 0x100000000; /* 4GB boundary MEMMAP=1024M$4096M */
void *resStart = 0;
size_t resSize = 64*1024*1024; /* 64MB size first 64B of 1024M */
void *resEnd = 0;
  resStart = mmap(0, resSize, PROT_READ|PROT_WRITE, MAP_SHARED,
                  fd, resPlace);

  resEnd = (void *) &((char *) resStart)[resSize];

#define ARGS 1, 1, 1, 0
  if(!mi_manage_os_memory(resStart, resSize, ARGS)) {
            resStart, resSize, ARGS);

  long option = mi_option_get(mi_option_limit_os_alloc);
  fprintf(stderr, "mi_option_limit_os_alloc=%ld setting to 1\n", option);
  mi_option_set(mi_option_limit_os_alloc, 1);

     void *ptr = allocations[i] = mi_malloc(this);
      sprintf((char *) ptr, "allocation %d %ld %p 0x%lx                           ", i, this, ptr, ptr-resStart);
      if(ptr<resStart || ptr>resEnd)
        fprintf(stdout, "%s not in reserve space\n", (char *) ptr);
      else
        fprintf(stdout, "%s\n", (char *) ptr);

UPDATE: as usual, once I've asked the question I find a solution myself, except I don't quite understand the behavior.

I'm now getting allocations out of the space I handed to mi_manage_os_memory(). The mi_option_limit_os_alloc did not work quite the way I expected. I malloc'd an array to home my allocations, and that array ended up creating a second arena in normal heap space from which all other allocations came. I took that first allocation out (make it allocate in stack), and now it's all happy.

I just want mi_malloc() and not the malloc() override.

-- BEST AND FINAL - found cmake -DMI_OVERRIDE=OFF - turns off the override functions so that I can have my allocator on the MEMMAP= space and have malloc inside of the program.

Once I apply it I might have more kvesching to do...

nufosmatic avatar Jul 14 '23 16:07 nufosmatic

I'm not even sure how I stumbled upon this, but fun! It seems to me that using a custom arena allocator would be the much easier way to accomplish this, but sometimes it can be fun to run uphill! :-)

Also, have you looked at CRIU (https://github.com/checkpoint-restore/criu) as another alternative approach to fast-startup by cloning an existing process which has done the heavy lifting?

Anyway, this comment is years late, so probably irrelevant at this point, and you were probably already on top of this anyway, but the thing that jumps out to me on first read is still ASLR - this is an explicit anti-ASLR-heap you're constructing, but needing to be careful about pointers to anything inside the remainder of the producer address space which will be different on the consumer machines if ASLR is enabled (or fully disable ASLR, but obvious caveats to doing that..)

Did you ever go anywhere with this?

missmah avatar Apr 04 '25 00:04 missmah

I noticed this on my feed and thought I will provide some comments as closure before I close this issue :)

TL;DR: We had abandoned the mimalloc experimentation effort and moved on!

While working on the original project effort mentioned in my ticket, we had 3 different approaches looked into by different engineers: Custom bump allocator approach, CRIU and mimalloc approach (by me).

Custom bump allocator had a more polished design but inorder to support that, major effort had to be expended in codebases upstream and downstream of the app in question (I forgot the details but I remember enough that it was more invasive to the point we decided to investigate the other 2 approaches).

CRIU seemed to show a lot of promise but the lead engineer who investigated the approach came back with the suggestion to drop that since they encountered few runtime exception when running a CRIU cloned process in another node (I don't remember the details but I think it was some floating point related). We could have fixed the couple of issues we encountered along the way but there was no guarantee that there won't be any more problems rearing its ugly head in the future and we don't want to take on any such tech debt.

mimalloc approach seemed to me the cleanest, but alas after my initial hacking through the source code I came to the conclusion that it was not as straightforward as I originally thought. By the time Daanx provided his suggestion I have tried them and many other approaches but it still didn't work (although I didn't try too hard :)). Finally as life would have it, we had to deal with shifting project priorities, team churn etc and we ultimately decided to shelve the project and I eventually moved on to a different team!

I just want to thank Daanx for mimalloc project (it really got me and few engineers on my team excited about memory management and I happily hacked on the source code for few months and loved the readable and well written C code). It was fun while it lasted!

beyonddream avatar Apr 12 '25 06:04 beyonddream