llama.cpp
llama.cpp copied to clipboard
Faster loading of the model
I was playing with the 65B model, and it took a minute to read the files. If you wrap the model loader loop with a #pragma omp parallel for
and add -fopenmp
to the compiler flags, you can drop it to 18 seconds.
Great idea. We prefer to not use -fopenmp
.
The implementation should use #include <thread>
and TBB? https://github.com/oneapi-src/oneTBB - lic: Apache
I remember that the mold linker project also uses it.
Not familiar with TBB, but most likely the answer is no
I have some experiments with optimizing large file read I/O in https://gist.github.com/kig/357a4193be54915d142f1db6063bc929 and https://github.com/kig/fast_read_optimizer if you want to overkill it...
Has this been implemented yet?