llama.cpp
llama.cpp copied to clipboard
Making weights loading faster
Tried to address slow weights loading. 7B is okay, but 13B is really slow (several minutes), hard to experiment/prototype with larger models.
Replaced std::ifstream
with C-style file reading using fopen
. Got a considerable boost in loading performance: 3x to 10x faster on my machine (measuring results were kinda inconsistent, but it is definitely a lot faster than before).
I made sure the weights are correctly loaded: fixed the seed, gave the same prompts - model gives the same output, everything is good.
Also increased the buffer size from 1024*1024
to 128*1024*1024
(see line 102) which gave a slight boost as well. Though I am not sure whether it is optimal for edge devices like Raspberry Pi (if it's of any concern).
Is this faster than using mmap (see #91 / #150)?
Is this faster than using mmap (see #91 / #150)?
Just tested out #150 implementation, it was a lot slower than simple fopen
. I believe #150 shoud've been faster on subsequent runs, but I didn't see any perceptible speed up, maybe just a little.
Is this faster than using mmap (see #91 / #150)?
By implementing #91 correctly, we can prevent creating any redundant copies of the model. This is crucial because if we duplicate it and there's enough space to store just one instance in memory, we run the risk of removing it from the operating system's file system cache.
@oKatanaaa switching between std::ifstream
and FILE*
should make no measurable difference. They are both tunable, do conceptually the exact same thing, and support (almost) exactly the same set of operations. The main bottleneck is the time spent inside of system calls reading from disk, and they make those calls exactly or almost exactly the same way. The first implementations literally called fopen()
directly. I forget if they still do.
Your testing methodology is probably at fault. Files that are opened once are likely to be opened again soon after. OSes like to cache the contents of files (and shared memory pages also) so that subsequent reads are faster. On linux you can flush these file and page caches with:
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null
You may have noticed that with the vanilla build on master loading the model sometimes skips ahead by quite a few dots in the progress bar, usually at the beginning, but sometimes a few times in the middle. This is why.
That said, I recommend that you don't do this. Justine's version of #150 is much better than mine. To use it, you can just clone the repo and switch to the mmap
branch. If you load the model once, subsequent loads happen instantaneously, as long as the kernel hasn't had any reason to evict the model to free up space.
As discussed on #91, it's not a perfect solution, but it's pretty great. Actually properly fixing this problem will require making changes to the ggml file format, and possibly the library. I recommend using Justine's branch in the meantime.
@apaz-cli Thanks for the input. I initially thought the same because performance comparisons for fstream vs fopen
I found were not really consistent (and very speculative) and depended on a particular case. Though intuition was telling me fopen
should be faster for binaries (as I recall from my experience). So I decided to see if anything changes and implemented fopen
model load. To my surprise, the performance gains were substantial, that's why I decided to make a pull request.
Speaking of the mmap
implementation, as far as I understand (though I don't understand a thing about mmap
really) it simply creates a large binary where model weights are stored contiguiosly (I guess?) and uses it to make faster subsequent loads. The model loading code is not changed at all, only the fancy memory allocation stuff, meaning that my change and Justine's code are perpendicual and can benefit each other.
Getting back to performance comparison of fstream vs fopen
(and also mmap
), I did some testing and here are results:
Implemetation | First load | Subsequent load | Read speed (mb/s) on subsequent run |
---|---|---|---|
master | 1m 49s | 1m 50s | 50 |
mmap | 8m 50s | 1m 10s | 20-30 |
fopen | 13s | 13s | 200-300 |
mmap + fopen | 6m 20s | 1m 8s | 20-30 |
My machine:
- CPU: Ryzen 7 4800H
- RAM: 32GB, 3200MHz (not sure if it's dual channel)
- SSD: ADATA SX6000PNP 1TB
Testing was done using 13B model. I measured time (both for first and subsequent loads) till the model processed the first token. I also monitored my disk current read/write speed. I made sure to run sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null
for each 'first load" and did several subsequent loads
picking an average result. Deviations from run to run weren't too large, so the numbers in the table should tell a general picture of things.
I also implemented mmap + fopen
by combining my code with the mmap
implementation and it did give a slight boost. Though mmap
implementation does a lot of writes to memory (page file I guess?) taking up a huge portion of all loading time, so on the first run we'll have to wait still.
An interesting thing is that subsequent runs on mmap
still run a lot slower than fopen
. I am not sure what's going on as I expected an instantaneous load (but intuition tells me that custom malloc
and realloc
read required tensors from disk on request as the model is being evaluated during the first run, please correct me if I'm wrong).
As a summary:
-
fopen
is indeed faster thanfstream
in this case, at least on my machine. It would be really good if other people could confirm performance gains on their machines. -
fopen
(current PR code) does not conflict with themmap
implementation in any way and both can be merged bringing benefits of both sides. - Current implementation of
mmap
may not work as expected (still slow subsequent runs), though there might be something up with my machine. Again, it'd be good to have other people verify these results.
When you say mmap, are you talking about #150? Because that was the first iteration of our mmap() work. The second iteration we did is in the https://github.com/ggerganov/llama.cpp/tree/mmap branch. That code doesn't have a load phase at all in subsequent runs.
It's also important to specify (1) what operating system you're using, and (2) what type of weights you're using. Even with the code in the master branch, the 13B model, with either q4 or f16 weights, your computer should not be taking 1+ minutes on subsequent runs. On my $1000 workstation, subsequent runs with the old method take ~3 seconds to load 13B F16. With the new method, it loads in 0 seconds. The only way I could see a computer taking >1m given your hardware, would be if it were running Windows, or if you had so many tabs open in your browser that there was practically no memory available for caches. For example, my workstation is headless and I use it via SSH, so that things like Chrome don't influence benchmarks.
@jart I was using the mmap
branch. I believe I earlier tried #150 and it didn't work for me well.
(1) I am on a Windows 11 machine, but building and running code inside a docker container (kitware/cmake:ci-debian10-x86_64-2023-03-08
). Just didn't want to have headaches from win toolchain (as I never worked with cmake on win), so opted for the simplest route.
(2) Q4 weights.
Anyways, I tried building for win and to my surprise fopen
and master implementations matched loading speeds (4 seconds, blazing fast). Now it completely agrees with @apaz-cli argument. It seems slow loading times are due to I/O problems inside the container and I am not sure why there is a such dramatic difference.
Sadly I couldn't build mmap
for win.
A little bit of googling gave these issues where people were struggling with slow I/O for mounted volumes (though on Mac):
- https://github.com/docker/for-mac/issues/77
- https://github.com/docker/for-mac/issues/1592
It also seems like the same issue might be present on Linux machines as well: https://forums.docker.com/t/docker-extremely-slow-on-linux-and-windows/129752
Seems like everything works well when building stuff 'natively'. So the benefits of this PR are not clear in native
settings, but it might be an option in others (again, only to have other people try out the code and see if it works better for them).
It's good that you're running Windows, because where we really need your help right now is in making sure our mmap branch is able to build with MSVC. Neither @apaz-cli or myself have access to MSVC at the moment. I wrote some polyfills and pushed them to the branch that I believe will work with WIN32. However the change was written a priori and needs to be cleaned up and debugged by someone who has MSVC. Otherwise we won't be able to merge the improvements into the master branch.
We could still possibly find ways to improve the experience for people who are using virtual file systems like Docker. However I would want to see a better theory that explains why the performance is bad and why it's actionable on our part. We can't help you find those answers. For example, I'm not convinced that using C stdio rather than C++ stdio is itself in any way different. An example of something that would possibly convince me, would be like, "if we use this stride obtained from the statfs() block size then we see optimal i/o behavior" or "if we have a second thread prefaulting pages".
Another thing that would convince me is if you showed me a perf record
profiling report showing, "here we see my glibc version of libstdc++ has a bug where it spends all its time chewing up CPU rather than actually doing I/O therefore we should just use C stdio." But until that happens, I can't merge a change like this.
@jart I'd be glad to be of any help with debugging mmap
for WIN32. What's needed right now is to clone mmap
branch, (try to) compile it using MSVC and see if there are any bugs in there, is that right? And where do I report my findings? #91?
I will also try to do some investigation with regards to loading performance inside docker.
@oKatanaaa The branch is already in the repo. Just git pull origin
and git checkout mmap
.
You can report in #91. You can also send a pull request that targets the mmap branch.
I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in https://github.com/ggerganov/llama.cpp/pull/564 and the specific make command below: make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'
There is no need to use a separate library to replace C++ fstreams with C file streams, as C file streams work perfectly in Windows too. You are conflating two things in the comments now, which are using mmap()
, which is non-portable and requires an external library or polyfill, while the PR is about using fopen()
and fread()
, both of which are portable and do not require any additions.
@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.
@anzz1 I looked at @CoderRC's project and it looks pretty good and it's unfair to say mmap() isn't portable because it's standardized by the IEEE. You'll also be happy to hear we now have our own working WIN32 mmap() polyfill, in our mmap
branch: https://github.com/ggerganov/llama.cpp/blob/mmap/mmap.c Me and @oKatanaaa spent the past few weeks building that. As I mentioned in my earlier comment, it's going to help make loading larger models with multi-dimensional tensors (e.g. 13B) about 10x to 100x faster, in terms of wall time. It doesn't require any libraries or dependencies. You're encouraged to take a look at the code and share your feedback!
@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.
I still want to tinker with the code a bit and find the reasons of slow loading inside docker on win, but I am focusing on other things at the moment, so we can close the PR. I'll try to do some research regarding this problem once I have some time. If I find anything I'll make a new PR with new fixes.