llama.cpp
llama.cpp copied to clipboard
Multi-threaded ggml_cpy
Reduces overall LoRA loading times significantly when using a different base model with --lora-base, from 32s to 24s in my test case.
It also seems to improve the general performance of ggml_cpy significantly, about twice as fast, but overall this is an insignificant fraction of the eval time, so it isn't really noticeable.
I tried to cover all the paths in ggml_cpy, but there are a lot of them and only a few are hit in llama.cpp, so I have not tested every single one.
Perplexity (bs=512):
MASTER: perf_total_per_op_us[ CPY] = 309.170 ms
PR: perf_total_per_op_us[ CPY] = 132.353 ms
LoRA (quantize):
MASTER: perf_total_per_op_us[ CPY] = 45.780 ms
PR: perf_total_per_op_us[ CPY] = 5.255 ms
I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of llama_apply_lora_from_file_internal, right? Can you show exactly what command lines you use?
This path is only used when LoRA is applied using a different base model specified with --lora-base, otherwise the quantization is done in a ggml_add instead. You can use a command line similar to this one:
./main -m models/7B/ggml-model-q4_0.bin --lora lora/baize-lora-7B/ggml-adapter-model.bin --lora-base models/7B/ggml-model-f16.bin
Thanks @slaren . I'm seeing 17s on master and 16s with your PR.
Just because the SIMD optimizations were up for discussion: with quantize_row_q_reference in ggml_compute_forward_dup_f16, the difference is greater. Master 33s, this PR 20s.