llama.cpp
llama.cpp copied to clipboard
Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32()
This part takes about 10% of the total inference time for 7B and it is currently single-threaded:
https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884
Try to multi-thread this by splitting the work across rows.
Since the GGML_TASK_INIT
currently runs only 1 thread, either:
- update
ggml
to support multi-threadedGGML_TASK_INIT
- move the quantization in
GGML_TASK_COMPUTE
(might be difficult since no barrier mechanism)