llama.cpp Making weights loading faster

Tried to address slow weights loading. 7B is okay, but 13B is really slow (several minutes), hard to experiment/prototype with larger models.

Replaced std::ifstream with C-style file reading using fopen. Got a considerable boost in loading performance: 3x to 10x faster on my machine (measuring results were kinda inconsistent, but it is definitely a lot faster than before).

I made sure the weights are correctly loaded: fixed the seed, gave the same prompts - model gives the same output, everything is good.

Also increased the buffer size from 1024*1024 to 128*1024*1024 (see line 102) which gave a slight boost as well. Though I am not sure whether it is optimal for edge devices like Raspberry Pi (if it's of any concern).

Mar 16 '23 22:03 oKatanaaa

Is this faster than using mmap (see #91 / #150)?

Mar 18 '23 11:03 sw

Is this faster than using mmap (see #91 / #150)?

Just tested out #150 implementation, it was a lot slower than simple fopen. I believe #150 shoud've been faster on subsequent runs, but I didn't see any perceptible speed up, maybe just a little.

Mar 18 '23 13:03 oKatanaaa

Is this faster than using mmap (see #91 / #150)?

By implementing #91 correctly, we can prevent creating any redundant copies of the model. This is crucial because if we duplicate it and there's enough space to store just one instance in memory, we run the risk of removing it from the operating system's file system cache.

Mar 18 '23 14:03 Piezoid

@oKatanaaa switching between std::ifstream and FILE* should make no measurable difference. They are both tunable, do conceptually the exact same thing, and support (almost) exactly the same set of operations. The main bottleneck is the time spent inside of system calls reading from disk, and they make those calls exactly or almost exactly the same way. The first implementations literally called fopen() directly. I forget if they still do.

Your testing methodology is probably at fault. Files that are opened once are likely to be opened again soon after. OSes like to cache the contents of files (and shared memory pages also) so that subsequent reads are faster. On linux you can flush these file and page caches with:

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

You may have noticed that with the vanilla build on master loading the model sometimes skips ahead by quite a few dots in the progress bar, usually at the beginning, but sometimes a few times in the middle. This is why.

That said, I recommend that you don't do this. Justine's version of #150 is much better than mine. To use it, you can just clone the repo and switch to the mmap branch. If you load the model once, subsequent loads happen instantaneously, as long as the kernel hasn't had any reason to evict the model to free up space.

As discussed on #91, it's not a perfect solution, but it's pretty great. Actually properly fixing this problem will require making changes to the ggml file format, and possibly the library. I recommend using Justine's branch in the meantime.

Mar 18 '23 18:03 apaz-cli

@apaz-cli Thanks for the input. I initially thought the same because performance comparisons for fstream vs fopen I found were not really consistent (and very speculative) and depended on a particular case. Though intuition was telling me fopen should be faster for binaries (as I recall from my experience). So I decided to see if anything changes and implemented fopen model load. To my surprise, the performance gains were substantial, that's why I decided to make a pull request.

Speaking of the mmap implementation, as far as I understand (though I don't understand a thing about mmap really) it simply creates a large binary where model weights are stored contiguiosly (I guess?) and uses it to make faster subsequent loads. The model loading code is not changed at all, only the fancy memory allocation stuff, meaning that my change and Justine's code are perpendicual and can benefit each other.

Getting back to performance comparison of fstream vs fopen (and also mmap), I did some testing and here are results:

Implemetation	First load	Subsequent load	Read speed (mb/s) on subsequent run
master	1m 49s	1m 50s	50
mmap	8m 50s	1m 10s	20-30
fopen	13s	13s	200-300
mmap + fopen	6m 20s	1m 8s	20-30

My machine:

CPU: Ryzen 7 4800H
RAM: 32GB, 3200MHz (not sure if it's dual channel)
SSD: ADATA SX6000PNP 1TB

Testing was done using 13B model. I measured time (both for first and subsequent loads) till the model processed the first token. I also monitored my disk current read/write speed. I made sure to run sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null for each 'first load" and did several subsequent loads picking an average result. Deviations from run to run weren't too large, so the numbers in the table should tell a general picture of things. I also implemented mmap + fopen by combining my code with the mmap implementation and it did give a slight boost. Though mmap implementation does a lot of writes to memory (page file I guess?) taking up a huge portion of all loading time, so on the first run we'll have to wait still.

An interesting thing is that subsequent runs on mmap still run a lot slower than fopen. I am not sure what's going on as I expected an instantaneous load (but intuition tells me that custom malloc and realloc read required tensors from disk on request as the model is being evaluated during the first run, please correct me if I'm wrong).

As a summary:

fopen is indeed faster than fstream in this case, at least on my machine. It would be really good if other people could confirm performance gains on their machines.
fopen (current PR code) does not conflict with the mmap implementation in any way and both can be merged bringing benefits of both sides.
Current implementation of mmap may not work as expected (still slow subsequent runs), though there might be something up with my machine. Again, it'd be good to have other people verify these results.

Mar 18 '23 21:03 oKatanaaa

When you say mmap, are you talking about #150? Because that was the first iteration of our mmap() work. The second iteration we did is in the https://github.com/ggerganov/llama.cpp/tree/mmap branch. That code doesn't have a load phase at all in subsequent runs.

It's also important to specify (1) what operating system you're using, and (2) what type of weights you're using. Even with the code in the master branch, the 13B model, with either q4 or f16 weights, your computer should not be taking 1+ minutes on subsequent runs. On my $1000 workstation, subsequent runs with the old method take ~3 seconds to load 13B F16. With the new method, it loads in 0 seconds. The only way I could see a computer taking >1m given your hardware, would be if it were running Windows, or if you had so many tabs open in your browser that there was practically no memory available for caches. For example, my workstation is headless and I use it via SSH, so that things like Chrome don't influence benchmarks.

Mar 18 '23 23:03 jart

@jart I was using the mmap branch. I believe I earlier tried #150 and it didn't work for me well.

(1) I am on a Windows 11 machine, but building and running code inside a docker container (kitware/cmake:ci-debian10-x86_64-2023-03-08). Just didn't want to have headaches from win toolchain (as I never worked with cmake on win), so opted for the simplest route. (2) Q4 weights.

Anyways, I tried building for win and to my surprise fopen and master implementations matched loading speeds (4 seconds, blazing fast). Now it completely agrees with @apaz-cli argument. It seems slow loading times are due to I/O problems inside the container and I am not sure why there is a such dramatic difference.

Sadly I couldn't build mmap for win.

A little bit of googling gave these issues where people were struggling with slow I/O for mounted volumes (though on Mac):

https://github.com/docker/for-mac/issues/77
https://github.com/docker/for-mac/issues/1592

It also seems like the same issue might be present on Linux machines as well: https://forums.docker.com/t/docker-extremely-slow-on-linux-and-windows/129752

Seems like everything works well when building stuff 'natively'. So the benefits of this PR are not clear in native settings, but it might be an option in others (again, only to have other people try out the code and see if it works better for them).

Mar 19 '23 07:03 oKatanaaa

It's good that you're running Windows, because where we really need your help right now is in making sure our mmap branch is able to build with MSVC. Neither @apaz-cli or myself have access to MSVC at the moment. I wrote some polyfills and pushed them to the branch that I believe will work with WIN32. However the change was written a priori and needs to be cleaned up and debugged by someone who has MSVC. Otherwise we won't be able to merge the improvements into the master branch.

We could still possibly find ways to improve the experience for people who are using virtual file systems like Docker. However I would want to see a better theory that explains why the performance is bad and why it's actionable on our part. We can't help you find those answers. For example, I'm not convinced that using C stdio rather than C++ stdio is itself in any way different. An example of something that would possibly convince me, would be like, "if we use this stride obtained from the statfs() block size then we see optimal i/o behavior" or "if we have a second thread prefaulting pages".

Another thing that would convince me is if you showed me a perf record profiling report showing, "here we see my glibc version of libstdc++ has a bug where it spends all its time chewing up CPU rather than actually doing I/O therefore we should just use C stdio." But until that happens, I can't merge a change like this.

Mar 19 '23 08:03 jart

@jart I'd be glad to be of any help with debugging mmap for WIN32. What's needed right now is to clone mmap branch, (try to) compile it using MSVC and see if there are any bugs in there, is that right? And where do I report my findings? #91?

I will also try to do some investigation with regards to loading performance inside docker.

Mar 19 '23 12:03 oKatanaaa

@oKatanaaa The branch is already in the repo. Just git pull origin and git checkout mmap.

Mar 19 '23 17:03 apaz-cli

You can report in #91. You can also send a pull request that targets the mmap branch.

Mar 19 '23 18:03 jart

I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in https://github.com/ggerganov/llama.cpp/pull/564 and the specific make command below: make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'

Mar 28 '23 01:03 CoderRC

There is no need to use a separate library to replace C++ fstreams with C file streams, as C file streams work perfectly in Windows too. You are conflating two things in the comments now, which are using mmap() , which is non-portable and requires an external library or polyfill, while the PR is about using fopen() and fread() , both of which are portable and do not require any additions.

Mar 29 '23 03:03 anzz1

@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.

Mar 29 '23 06:03 jart

@anzz1 I looked at @CoderRC's project and it looks pretty good and it's unfair to say mmap() isn't portable because it's standardized by the IEEE. You'll also be happy to hear we now have our own working WIN32 mmap() polyfill, in our mmap branch: https://github.com/ggerganov/llama.cpp/blob/mmap/mmap.c Me and @oKatanaaa spent the past few weeks building that. As I mentioned in my earlier comment, it's going to help make loading larger models with multi-dimensional tensors (e.g. 13B) about 10x to 100x faster, in terms of wall time. It doesn't require any libraries or dependencies. You're encouraged to take a look at the code and share your feedback!

Mar 29 '23 06:03 jart

@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.

I still want to tinker with the code a bit and find the reasons of slow loading inside docker on win, but I am focusing on other things at the moment, so we can close the PR. I'll try to do some research regarding this problem once I have some time. If I find anything I'll make a new PR with new fixes.

Mar 30 '23 07:03 oKatanaaa

llama.cpp llama.cpp copied to clipboard

Making weights loading faster

llama.cpp
llama.cpp copied to clipboard