llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32()

Open ggerganov opened this issue 1 year ago • 0 comments

This part takes about 10% of the total inference time for 7B and it is currently single-threaded:

https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884

Try to multi-thread this by splitting the work across rows. Since the GGML_TASK_INIT currently runs only 1 thread, either:

  • update ggml to support multi-threaded GGML_TASK_INIT
  • move the quantization in GGML_TASK_COMPUTE (might be difficult since no barrier mechanism)

ggerganov avatar Apr 20 '23 15:04 ggerganov