mistral.rs
mistral.rs copied to clipboard
Quantized Mistral: Prompt processing slower than llama.cpp
Since generation speed is almost matching llama.cpp after https://github.com/EricLBuehler/mistral.rs/pull/152 I think it's worth it trying to optimize prompt processing now.
Llama.cpp
/home/lucas/oss/llama.cpp/llama-bench -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1
Mistral.rs
"/home/lucas/oss/mistral.rs/target/profiling/mistralrs-bench" -p 512 -g 0 -r 1 -c 1 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
Llama.cpp does dequant first, then matmul. We're doing dequant and matmul directly.
This issue is useful https://github.com/ggerganov/llama.cpp/pull/3776 where they enable the current approach
@lucasavila00, do you think we should also dequantize to F16 for large batch size? To my understanding, this beneficial because the BLAS implementation of matrix-matrix product is faster than our MMQ kernel as the batch size increases.
@EricLBuehler I'd like to test it...
I tried running the candle example using candle before they added the MMQ kernels, and performance was the same-ish.
I also tried to manually dequantize the QMatMuls of the attention layer and saw no improvements.
If you have a different approach I'd be glad to test it.
https://github.com/huggingface/candle/pull/1706
https://github.com/huggingface/candle-cublaslt
I think we need to dequantize and use these cublastlt kernels? I'll try it
@lucasavila00, that sounds great. Please let me know the results!
@EricLBuehler candle already uses cublaslt, see MR https://github.com/EricLBuehler/mistral.rs/pull/230
forcing dequantization then matmul
./target/profiling/mistralrs-bench -p 512 -g 0 -r 5 -c 1 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 286.886±6.405 | 3.487±0.080 | 1 | 286.8858 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
master
./target/profiling/mistralrs-bench -p 512 -g 0 -r 5 -c 1 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 547.439±18.785 | 1.829±0.065 | 1 | 547.43933 |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
@lucasavila00, that is very interesting. How did you force the dequantization?
@lucasavila00, that is very interesting. How did you force the dequantization?
With the lt_mul function of the MR https://github.com/EricLBuehler/mistral.rs/pull/230/files#diff-da1e6f56f0e565985ccaa246f41d45f33271525bb3ae0d3a776cb282ce797676R27
I forced it for the attention weights and MLP only
@lucasavila00, does llama.cpp
also get a similar T/s to our 549? It seems like dequantizing reduces performance severely, but perhaps it is better for bigger batch sizes?
llama.cpp
is 1700t/s, I forced it to use bs=512 and pp=512, which should be equal to our pp=512
$ /home/lucas/oss/llama.cpp/llama-bench -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1 -b 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | n_batch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CUDA | 99 | 512 | pp 512 | 1747.07 ± 0.00 |
build: 7593639c (2679)
$ ./target/profiling/mistralrs-bench -p 512 -g 0 -r 1 -c 1 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-27T23:38:07.937134Z INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-27T23:38:07.937150Z INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-27T23:38:07.937153Z INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-27T23:38:07.937168Z INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
[mistralrs-core/src/models/quantized_llama.rs:392:9] &layers.len() = 32
2024-04-27T23:38:09.636351Z INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-27T23:38:09.667093Z INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 596.042±0.000 | 1.678±0.000 | 1 | 596.04193 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------
@EricLBuehler when I run llama.cpp and mistral.rs in interactive mode then I get close results...
https://gist.github.com/lucasavila00/0155f94fbf13e988384af53af8841b0f
llama_print_timings: prompt eval time = 706,45 ms / 436 tokens ( 1,62 ms per token, 617,17 tokens per second)
2024-04-27T23:46:36.882094Z INFO mistralrs_core::engine: Prompt[445] Completion[] - 765ms
So I guess our pp
benchmark is incorrect in its attempt to match llama.cpp? I'm lost now :smile:
Ah, nevermind the above. Llama.cpp samples 700tok/s in CPU. I forgot the ngl param
https://gist.github.com/lucasavila00/646b6f6cb9757d1329dc7296b5f16e3e
llama_print_timings: prompt eval time = 279,40 ms / 436 tokens ( 0,64 ms per token, 1560,48 tokens per second)
So llama.cpp is indeed 3x faster, both benchmarks measure correctly etc
When I force de-quantization & matmul, candle uses these volta
kernels (and so does forcing cublaslt)
But llama.cpp uses some turning
kernels
@lucasavila00, I wonder if it is the volta
kernels that are slower than turing
? It seems like we spend ~62% of our time in the sgemm function, but llama.cpp spends ~21-27% of their time in h1688gemm.
@EricLBuehler that's seems to be the case. I can't find where the turning kernels come from though. I assume these are from an nvidia library, but I can't figure out why llama.cpp uses a different version from candle/cudarc :thinking:
The version differs depending on heuristics
Using this for matmuls I can trigger the turning kernels, but it takes too long on the f32->f16 conversions :thinking:
fn lt_mul(xs: &Tensor, w: &QMatMul) -> Result<Tensor> {
let w = match w {
QMatMul::QTensor(ref qt) => qt.dequantize(xs.device())?,
QMatMul::Tensor(w) => w.clone(),
};
let w = match *xs.dims() {
[b1, b2, _, _] => w.broadcast_left((b1, b2))?.t()?,
[bsize, _, _] => w.broadcast_left(bsize)?.t()?,
_ => w.t()?,
};
let xs = xs.to_dtype(DType::F16)?;
let w = w.to_dtype(DType::F16)?;
xs.matmul(&w)?.to_dtype(DType::F32)
}
Llama.cpp can dequantize directly to f16, candle cannot... Maybe it's worth it to raise an issue for direct-f16-dequantization?
@lucasavila00, I have raised an issue.
The PR https://github.com/EricLBuehler/mistral.rs/pull/238 has the latest iteration of the code.
It uses dequant+matmul only for prompts, and does the matmul in f16.
It also has comparisons of runs between mistralrs-bench
and llama-bench
, and nvidia profiles of the 2 projects.
I think the current difference is now due to different kernels?
Even though the names of the kernels are almost the same, it seems the ones used by candle are slower.
I'm trying to figure out why they don't use the exact same kernels.
The kernels distribution between llama.cpp and mistral.rs are almost the same. And the overall time matches the discrepancy between those 2 kernels.
If I am not mistaken, our completion performance should also be improved by 60% (like prompt perf) because of the new F16 dequant support?
If I am not mistaken, our completion performance should also be improved by 60% (like prompt perf) because of the new F16 dequant support?
For batch sizes > 8, yes.
For batch sizes <=8 I think we'll want to continue to use MMQ (that's what llama.cpp does)
The cublas MR still has these as TODOs though https://github.com/EricLBuehler/mistral.rs/pull/238/files#diff-da1e6f56f0e565985ccaa246f41d45f33271525bb3ae0d3a776cb282ce797676R20-R22
Ah, ok. I'm interested in how our performance compares to llama.cpp in that situation.
That MR currently uses cublas for prompt and MMQ for completion.
It should be something like cublas for prompt if seq_len > 32, otherwise MMQ. And for completion it should use MMQ if bs <=8, otherwise cublas.
These are the llama.cpp heuristics if I understood it correctly
Ah, I'm not even benchmarking prompts with batch sizes > 1, because I'm assuming we'll move forwards with https://github.com/EricLBuehler/mistral.rs/pull/234
Yes, I just need to finish the testing and then I'll merge #234. I am looking forward to Candle adding support for calling hgemm, but if that takes a while I can add it.
I think we're not measuring the same timings as llama.cpp exactly. Prompt timings include a memory transfer and the sampling.
After https://github.com/huggingface/candle/issues/2139#issuecomment-2081740003
If I look at just the nvidia profile of a warmed run, llama.cpp takes ~350ms and mistral.rs takes ~400ms.
That puts llama.cpp at ~1500t/s and mistral.rs at ~1300t/s
@lucasavila00 yes, that is possible. Are they timing the memory transfer and sampling?
@lucasavila00 yes, that is possible. Are they timing the memory transfer and sampling?
No, they're just synchronizing.
I wonder why mistral.rs has this 35ms of DtoH transfer. It happens only at prompt time, so it can't be logits transfer to CPU...
BTW this is llama.cpp, filtered
And mistral.rs, filtered