llama.cpp Faster loading of the model

Faster loading of the model

Open kig opened this issue 1 year ago • 5 comments

I was playing with the 65B model, and it took a minute to read the files. If you wrap the model loader loop with a #pragma omp parallel for and add -fopenmp to the compiler flags, you can drop it to 18 seconds.

Mar 13 '23 08:03 kig

Great idea. We prefer to not use -fopenmp. The implementation should use #include <thread>

Mar 13 '23 08:03 ggerganov

and TBB? https://github.com/oneapi-src/oneTBB - lic: Apache

I remember that the mold linker project also uses it.

Mar 15 '23 17:03 kassane

Not familiar with TBB, but most likely the answer is no

Mar 15 '23 20:03 ggerganov

I have some experiments with optimizing large file read I/O in https://gist.github.com/kig/357a4193be54915d142f1db6063bc929 and https://github.com/kig/fast_read_optimizer if you want to overkill it...

Mar 16 '23 01:03 kig

Has this been implemented yet?

Jun 07 '23 14:06 maxtriano

llama.cpp llama.cpp copied to clipboard

Faster loading of the model

llama.cpp
llama.cpp copied to clipboard