llama.cpp Should use `mmap` for model loading

So it doesn't create an extra copy in RAM and lives in the kernel page cache happily, loading instantly on subsequent runs.

Mar 13 '23 11:03 l29ah

@ggerganov I'm working on putting together the PR. Almost done.

I don't know anything about the order that ggml accesses the weights in. Would you say that it's sequential? If so, there's also madvise().

Mar 14 '23 09:03 apaz-cli

you probably don't want to use madvise+MADV_SEQUENTIAL, as in addition to increasing the amount of readahead it also causes pages to be evicted after they've been read - the entire model is going to be executed at least once per output token and read all the weights, MADV_SEQUENTIAL would potentially kick them all out and reread them repeatedly.

what may be more appropriate is to use MADV_WILLNEED on the whole model to kick off paging it all in without needing to wait for it to finish - but mmap can be tricky and you would probably want to make it an option rather than the default as it may not be a perf improvement on all setups and can wind up being slower than regular I/O due to causing lots of TLB shootdowns - you would want to benchmark it, as its not unlikely you may be trading improved time-to-first-token for worse overall throughput

Mar 14 '23 17:03 apage43

That will definitely happen with posix_fadvise(sequential), which has a very gentle impact on file caches on Linux. What we might end up wanting here is madvise(random). In order to do that though, we first would need to find a way to avoid the loading and deserialization process where c/c++ data structures are constructed in memory, and instead have the runtime data structures just be mapped directly from the file. That would ensure 100% reduction in startup time, which means we can start generating tokens asap, and pages get loaded off disk on an as-needed basis. Once we're able to implement that design pattern, madvise(random) vs. madvise(sequential) would be a tool that lets the kernel know how to utilize an under-utilized disk, to make predictions on avoiding page faults.

I'm still getting up to speed on this codebase, so I'd like to hear everyone's ideas on how best we'd ensure object data structures (or their floating point content at the very least) could be made directly mappable, thus side-stepping the loading process. One dirty hack for example I've been considering, would be overriding the memory allocators to get all objects at a fixed address, and persisting that to disk. That way, when all the C/C++ objects are loaded back into memory using MAP_FIXED, no relocations would need to be performed. That's obviously a less portable and non-ideal solution, but it'd help us get instant loading happening as quickly as possible, and furthermore permit us an opportunity to explore precisely how sparse the model's memory usage patterns actually are.

Mar 16 '23 11:03 jart

@jart Thanks for stepping in. I will share briefly an idea that might be useful. Just mind I haven't looked into details of the discussion - will do in a few days once things cool off a bit here.

I think ggml_context is extremely well fit for mmap if I understand how it works. The ggml_context uses an externally provided buffer of memory with a pre-determined size:

https://github.com/ggerganov/llama.cpp/blob/721311070e31464ac12bef9a4444093eb3eaebf7/main.cpp#L569-L575

All tensors ~and model parameters~ are "emplaced" into this buffer. There are no extra allocations ocuring. Once you load the model, you can simply dump the memory buffer provided to ggml_context and next time, you can simply load this buffer instead of constructing it. Everything should work.

Edit: So above I incorrectly referenced the "eval" ggml_context which has it's own buffer. The "model" ggml_context is here:

https://github.com/ggerganov/llama.cpp/blob/721311070e31464ac12bef9a4444093eb3eaebf7/main.cpp#L228-L240

Same stuff. If the pointer is NULL it's allocated inside ggml for convenience.

Mar 16 '23 13:03 ggerganov

All tensors and model parameters are "emplaced" into this buffer. There are no extra allocations ocuring.

In other words, there is a set memory cap? Like many other people I have basically no experience in AIs or memory buffers I'm just a guy pushing buttons until either something explodes or an ai becomes self aware lol

Mar 16 '23 15:03 nazthelizard122

For loading from a physical/network drive resharding the larger models to a single file might help, imho. Whereas loading multiple files in parallel would be slower.

https://github.com/jankais3r/LLaMA_MPS/blob/main/reshard.py

Mar 16 '23 16:03 bitRAKE

I'm getting ready to take another swing at it. My idea of what to do so far:

Create functions in utils.h called llama_load_buffer(), llama_save_buffer(), and llama_destroy_buffer(). These will mmap() (or just malloc() and read), save, and munmap() (or just free()) the buffers respectively. So, files saved on one machine can't necessarily be loaded by another. These files would be stored in some folder, and have the names of the original files. Either in models, or /tmp, or a new folder. This will also hopefully be useful for implementing saving the model state.
Add a new command line argument that tells llama_model_load() to look in this cache folder first. If it finds the file, llama_load_buffer() the file to get your ggml_init_params. Then do whatever else needs to be done (initialize the vocab, get hparams, etc) and exit the function. If the argument is present, call llama_save_buffer() first. Also, call llama_destroy_buffer() at the appropriate location.

I can do 1, I'll submit a PR for that shortly, but it isn't super clear to me how memory is laid out so that I can do 2. In particular, I'm wondering about the "whatever else needs to be done" part. I'm certain that I'm missing something, and it probably wouldn't be obvious to me what I'm breaking even after hours of monkeying.

Mar 16 '23 19:03 apaz-cli

My concern with doing that is, wouldn't it effectively double the disk usage? LLaMA is big enough that folks are already likely to be stretched thin on disk space after creating the second copy needed to quantize the model. I'm still working on studying the codebase to determine what exactly are the transformation that need to be made at runtime. For example, if it's just a bunch of float16's on disk, and we're using a bunch of float16's in memory, then I don't see why the buffer field of these tensors couldn't just be populated with pointers to the appropriate positions in the file. Unless of course it needed to be reshaped or shifted to meet things like AVX alignment requirements. In that case, we'd ideally want to modify the quantizer script so that it generates something suitable for our purposes, so that only a single conversion step needs to be performed.

Mar 16 '23 19:03 jart

(comment is more a #202 thing)

This is the way I was thinking about it:

After the model loads and the prompt is tokenized, create a hash of the context. If that hash exists in the cache directory (and flag is set), load state.

Saving the state has two modes, imho: post-prompt (only works when state hasn't been loaded) and at end of generation. The post-prompt mode allows jump-starting the model. End saving allows to start-up where one left off.

As to the memory organization - I'd leave that to existing code. The hash would act as pseudo-verification of that it's okay to load a buffer of bytes. The model and prompt would need to be the same, maybe even other options.

Just throwing out some head-space. Haven't starting coding anything.

Mar 16 '23 21:03 bitRAKE

Wait. What problem are we trying to solve here exactly? Are we trying to (1) eliminate the three second startup delay? Or are we trying to (2) store the changes made to memory back to disk? Because if your goal is to solve (2) then the only thing you need to save are the random seed and the prompt, since that would restore the state deterministically. Right now I'm focusing on (1) since having fast mmap() loading would not change llama.cpp's behavior, and would instead simply make it go faster. If you want (2) then this change could be the stepping stone you need. All you'd have to do is change MAP_PRIVATE to be MAP_SHARED instead, and whatever mutations are made to the tensors in memory will be transparently remembered on disk. However that's orthogonal to my intended goals at the moment.

Mar 16 '23 21:03 jart

I did not intent to expand the meaning of the thread. (2) should probably be addressed elsewhere. #202

Mar 16 '23 21:03 bitRAKE

@jart It would double the disk usage, yes. But so does converting the model, and so does quantizing it. I think people are prepared for this.

You're right though in that the scripts that convert the model are probably the best way to do this. I was only thinking about implementing a cache for ggml_init_params as originally suggested. Ideally though, everything should just be one call, for everything from vocab/hparams to model weights.

Mar 16 '23 22:03 apaz-cli

So I added some logging statements to track the read() operations that are happening. It's 200k+ lines that look like this:

moving 0x640 bytes from offset 0x4a607 to offset 0 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4ac47 to offset 0xc80 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4b287 to offset 0x1900 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4b8c7 to offset 0x2580 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4bf07 to offset 0x3200 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4c547 to offset 0x3e80 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4cb87 to offset 0x4b00 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4d1c7 to offset 0x5780 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4d807 to offset 0x6400 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4de47 to offset 0x7080 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4e487 to offset 0x7d00 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4eac7 to offset 0x8980 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4f107 to offset 0x9600 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4f747 to offset 0xa280 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x4fd87 to offset 0xaf00 (n_dims=2 n_parts=2)
moving 0x640 bytes from offset 0x503c7 to offset 0xbb80 (n_dims=2 n_parts=2)

All it's doing is (1) reshaping and (2) aligning the data in the file. That's why llama.cpp takes several seconds to start. It wouldn't make sense to cache a bunch of memcpy() operations. The quickest thing we could do is introduce a third conversion step that creates a new file format, where the data is in the appropriate shape and alignment ahead of time. Then we could work our way backwards through the conversion tools, to reduce the number of pipeline chores from 3 to 1.

Mar 16 '23 23:03 jart

Here's another reason why this issue is so important. I just ran the 13B model with F16C on my workstation with 32GB of RAM. The model, once loaded, comes very close to hitting the physical memory limit, using maybe ~30GB peak RSS. Bringing memory up to the edge of swapping effectively compounds tragedy, since the kernel reacts by dropping its file caches. If we were using mmap() then the kernel would know that the loaded pages and the file pages are the same thing. But since we're copying the memory, the file cache goes away, and loading ends up taking a minute long each time.

Mar 17 '23 00:03 jart

@apaz-cli Have you attempted implementing yet the thing you proposed? It might work if you use MAP_FIXED when reloading it, since GGML appears to allocate objects with pointers too.

Mar 17 '23 00:03 jart

@jart I have no idea how to support that in a portable way. I haven't dug too deep into it. I'm halfway through implementing part 1.

The troubling thing is actually the default implementation for opening files with the C/C++ stdlib. There is no portable way in C++11 to check the size of a file or binary stream, not even with fseek()/seekg() and ftell()/tellg(). C++17 resolves this with std::filesystem, but other versions of the standard are out of luck. You have to guess, and resize/copy if you're wrong. Which seems not acceptable. The other way to do it is to read all the bytes of the file once just to get the size, and then do it again. This seems also not acceptable. Unless the compiler is somehow magically able to see through it. I haven't checked, but it doesn't seem that likely.

See this link to the C standard. The C++ standard says the same about it's own streams.

Although it's UB, the fseek()/ftell() dance is a classic, and is supported on almost all platforms. So we could just do it anyway.

Mar 17 '23 02:03 apaz-cli

the mmap operation itself is going to have its own portability issues, supporting all platforms on a first pass with no #ifdefs is unlikely here - mmap also requires the fd being mapped to actually be a file on a filesystem, which is probably implied if its seekable, but fstat (or equivalent) is probably the better way to check for that and get the size at the same time

Mar 17 '23 04:03 apage43

I've implemented a working prototype for UNIX systems. https://github.com/ggerganov/llama.cpp/commit/5b8023d935401072b73b63ea995aaae040d57b87

Your LLaMA AGI models will now load instantly without any user visible latency.

https://user-images.githubusercontent.com/49262/225831096-fa53d227-84a4-4dd2-8e62-4630c16a481e.mp4

The above video should be all the proof you need.

I did this by writing an append-only malloc() that lets us transactionally capture heap allocations, so they can be restored on subsequent runs. It's a ~200 LOC change that only took me a few hours and it worked the first time. Much easier than the alternative, which likely would have entailed serializing C++ STL objects. This change could be productionized as it stands. I'd need to add the WIN32 mmap() code. I'd also need to store the flags and possibly file timestamps for comparison in the serialized object too, since right now the state change can only happen when magic.dat is deleted. We'd also want to put this behind a flag.

However, I still firmly believe this change is not the right thing to do. The file format should be fixed so that it's aligned and doesn't need to be reshaped. Doing that will take 1k+ lines of refactorings, plus a data migration, plus changes to the GGML API. I don't think it's possible to use the ggml_init_params::mem_buffer field, because that memory region contains pointers. That makes it basically the same as my malloc() capturing code, except less generalized. If you wanted to mmap() that field in a portable way, you'd have to do what linkers do, and apply fixups to all the pointers. What I thought might make more sense, is doing a tensor->data = mmap() from a given file offset for each tensor (since I'm assuming there aren't that many of them?)

I'll also note that the gains here are mostly due to not copying memory anymore, and better cooperation with the kernel's page manager. We unfortunately aren't getting any additional gains from lazy page loading, since this is a dense model. To generate a single token, every single page in the model file needs to be loaded. What this means is that first runs that load from spinning disk are still going to be slow, even though the average case has greatly improved. I don't view that as a problem, since having the better cooperation with the page manager ensures that the kernel file caches are much less likely to be evicted.

Mar 17 '23 07:03 jart

I'll also note that the gains here are mostly due to not copying memory anymore, and better cooperation with the kernel's page manager. We unfortunately aren't getting any additional gains from lazy page loading, since this is a dense model. To generate a single token, every single page in the model file needs to be loaded. What this means is that first runs that load from spinning disk are still going to be slow, even though the average case has greatly improved. I don't view that as a problem, since having the better cooperation with the page manager ensures that the kernel file caches are much less likely to be evicted.

On devices with less RAM and no swap (iOS), will this allow the inference to proceed without hitting the memory limit by evicting weights from the page cache during inference?

For the pointers issue, you could make a custom smart pointer type that uses a global variable to do the fixups at runtime (not sure if this would have a perf impact though):

void *ctx_base;

template<typename T>
class ctx_ptr {
  off_t offset;
  inline ctx_ptr(T *value): offset(value - ctx_base) {
    assert(offset > 0);
  }
  inline T operator->() {
    return *(ctx_base + offset);
  }
};

// usage:
ctx_base = 0x1234;
ctx_ptr<ggml_whatever> ptr = ctx_ptr(the_raw_ptr);

ctx_base = 0x5678;
printf("%s\n", ptr->some_field);

Mar 17 '23 12:03 j-f1

mmap(2) allows to use files larger than available RAM.

Mar 17 '23 18:03 hmage

@jart

I don't think it's possible to use the ggml_init_params::mem_buffer field, because that memory region contains pointers.

Sorry, I missed that.

What I thought might make more sense, is doing a tensor->data = mmap() from a given file offset for each tensor (since I'm assuming there aren't that many of them?)

This is possible for non-sharded models. For example, if you take a look at the gpt-2 example, the model loading is a straight read from the file into each tensor:

https://github.com/ggerganov/ggml/blob/4c2f924553312c490e79e6e1739c6f4aa9bbd450/examples/gpt-2/main.cpp#L329-L331

However, the larger LLaMA models ( >7B ) are split into parts, and one has to merge them. Each tensor is split across all the parts. Either by rows or by columns. So the reading logic becomes complicated because of that:

https://github.com/ggerganov/llama.cpp/blob/367946c668757532deed929e1d78673c6ac6bcb8/main.cpp#L457-L503

We could create a combined ggml model file as part of the setup process, but this way we go back to the issue of needing double disk space.

Anyway, impressive stuff!

Mar 17 '23 18:03 ggerganov

We could create a combined ggml model file as part of the setup process, but this way we go back to the issue of needing double disk space.

Maybe the conversion script could combine all the .pth files into one GGML file?

Mar 17 '23 18:03 j-f1

@ggerganov

However, the larger LLaMA models ( >7B ) are split into parts, and one has to merge them. Each tensor is split across all the parts. Either by rows or by columns. So the reading logic becomes complicated because of that:

Perhaps the merging can be addressed with multiple mmaps stitched together into a contiguous region as described here:

https://stackoverflow.com/a/34560306

Mar 17 '23 19:03 IkerAriz

@apaz-cli @apage43 Earlier you both raised concerns about portability. POSIX code is standardized and portable. I see however that our CI system requires that we support MSVC. I don't have access to MSVC right now. But I've tried my best to implement a set of polyfills for both WIN32 and POSIX.1 in the following change: 0b5448a3a485eb962916ce9579e335084a940f9c Could you please take a look?

Mar 18 '23 04:03 jart

I did a change like this as well by first swapping std::ifstream to custom file reader that reads out of mmap. Which btw as consequence showed that one of the .read calls actually fails when reading the data! It only really works for 7B model because file format is non-optimal for mmap.

+1 on changing file format to merge all data and allow for mmap

Mar 18 '23 10:03 moonshadow565

@jart I'm more lamenting at the absurdity that there's no portable (C++11) way to find the size of a file. It truly baffles me. On posix there's fstat. On Windows there's GetFileSize / GetFileSizeEx. But there are other platforms that are not windows or posix, and they would no longer be supported.

Sadly, I do not have access to a windows machine or MSVC either.

Mar 18 '23 18:03 apaz-cli

What are those platforms? I would have assumed C++11 would be less portable than POSIX.

Mar 19 '23 01:03 jart

This should apply to whisper.cpp as well, right? Since it is for a ggml model...

Mar 19 '23 01:03 heshpdx

Another approach to make it run instantly without reloading the weights every time is by using a client/server model, which if what I've done in #278. This is how it works:

The model is loaded during server startup, then each connection is served in a forked child, and using a simple protocol you can pass some of the command-line parameters to customize that session and interact with the model over the socket.

It is as if you are instantly spawning llama.cpp with preloaded model, only instead of running the program from the scratch, you open a TCP connection to interact with the program.

Mar 20 '23 00:03 tarruda

@tarruda What value does your TCP server offer, other than solving a problem that's already being solved by mmap()?

Mar 20 '23 03:03 jart

llama.cpp llama.cpp copied to clipboard

Should use `mmap` for model loading

llama.cpp
llama.cpp copied to clipboard