llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Multi-threaded ggml_cpy

Open slaren opened this issue 2 years ago • 3 comments

Reduces overall LoRA loading times significantly when using a different base model with --lora-base, from 32s to 24s in my test case.

It also seems to improve the general performance of ggml_cpy significantly, about twice as fast, but overall this is an insignificant fraction of the eval time, so it isn't really noticeable.

I tried to cover all the paths in ggml_cpy, but there are a lot of them and only a few are hit in llama.cpp, so I have not tested every single one.

Perplexity (bs=512):

MASTER: perf_total_per_op_us[             CPY] =  309.170 ms
PR:     perf_total_per_op_us[             CPY] =  132.353 ms

LoRA (quantize):

MASTER: perf_total_per_op_us[             CPY] =  45.780 ms
PR:     perf_total_per_op_us[             CPY] =   5.255 ms

slaren avatar Apr 18 '23 01:04 slaren

I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of llama_apply_lora_from_file_internal, right? Can you show exactly what command lines you use?

sw avatar Apr 18 '23 15:04 sw

This path is only used when LoRA is applied using a different base model specified with --lora-base, otherwise the quantization is done in a ggml_add instead. You can use a command line similar to this one:

./main -m models/7B/ggml-model-q4_0.bin --lora lora/baize-lora-7B/ggml-adapter-model.bin --lora-base models/7B/ggml-model-f16.bin

slaren avatar Apr 18 '23 15:04 slaren

Thanks @slaren . I'm seeing 17s on master and 16s with your PR.

Just because the SIMD optimizations were up for discussion: with quantize_row_q_reference in ggml_compute_forward_dup_f16, the difference is greater. Master 33s, this PR 20s.

sw avatar Apr 18 '23 15:04 sw