mistral.rs
mistral.rs copied to clipboard
Quantized: Use cublas for prompt
This
$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m The
Bloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T05:58:00.751771Z INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T05:58:00.751790Z INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T05:58:00.751793Z INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T05:58:00.751810Z INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T05:58:02.469281Z INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T05:58:02.499735Z INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 57.762±0.583 | 17.314±0.176 | 1 | 57.762444 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 719.654±10.389 | 1.390±0.020 | 1 | 719.6544 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 46.236±1.463 | 21.650±0.692 | 2 | 92.4723 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 458.195±3.664 | 2.183±0.017 | 2 | 916.38995 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 29.450±0.103 | 33.956±0.118 | 4 | 117.80009 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 260.459±1.187 | 3.839±0.018 | 4 | 1041.8367 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
Master
$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T06:00:24.120633Z INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T06:00:24.120654Z INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T06:00:24.120657Z INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T06:00:24.120671Z INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T06:00:25.850501Z INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T06:00:25.882085Z INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 58.091±0.919 | 17.219±0.274 | 1 | 58.09086 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 620.491±10.625 | 1.612±0.028 | 1 | 620.4911 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 47.455±0.311 | 21.073±0.138 | 2 | 94.91029 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 328.383±1.779 | 3.045±0.017 | 2 | 656.7665 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 29.232±0.055 | 34.209±0.064 | 4 | 116.927444 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 163.163±0.807 | 6.129±0.030 | 4 | 652.6506 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
Llama.cpp
$ /home/lucas/oss/llama.cpp/llama-bench -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1 -b 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | n_batch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CUDA | 99 | 512 | pp 512 | 1747.07 ± 0.00 |
build: 7593639c (2679)
Honestly, I don't think it's worth it to merge it just for this small win. This breaks usage for GPUs that don't support F16...
But it's an improvement...
Code Metrics Report
=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 5 9 9 0 0 Python 21 741 622 21 98 TOML 16 419 378 1 40 ------------------------------------------------------------------------------- Jupyter Notebooks 1 0 0 0 0 |- Markdown 1 60 30 22 8 |- Python 1 96 87 1 8 (Total) 156 117 23 16 ------------------------------------------------------------------------------- Markdown 16 1026 0 758 268 |- BASH 6 205 192 0 13 |- Python 6 121 110 0 11 |- Rust 3 185 172 9 4 (Total) 1537 474 767 296 ------------------------------------------------------------------------------- Rust 81 26376 24282 334 1760 |- Markdown 38 359 0 354 5 (Total) 26735 24282 688 1765 =============================================================================== Total 143 29047 25685 1114 2248 ===============================================================================
mistral.rs
llama.cpp
After direct f16 dequant:
This
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 1016.721±6.399 | 0.984±0.006 | 1 | 1016.7206 |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
Master
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 614.365±2.955 | 1.628±0.008 | 1 | 614.3652 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
@lucasavila00, that's amazing! I am looking forward to merging this.
Can you please add the special matmul function to the layer.rs file so we can use it in all models? Ideally, we can also implement it in QLinear so ISQ can benefit, too.
I need to integrate these candle changes https://github.com/huggingface/candle/pull/2141
And this PR should be ready
I'll try to do it tonight
@EricLBuehler would you mind updating your fork? It does not include the precision changes
@lucasavila00 sure, I just updated it.
I did not implement it for the other models. I can try to do it later in another MR.
I don't have the bandwidth to work with different models and setups today.
Hi @lucasavila00, I'm looking forward to merging this PR!
I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.
Hi @lucasavila00, I'm looking forward to merging this PR!
I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.
There were changes to the is_prompt
bool I was using to decide which version to use. I changed it to a different heuristic but I'm not sure it's optimal.
I'm sorry but I don't have time to benchmark it soon.
Can we check if seq_len>1
? I think that would be pretty reliable.
@lucasavila00 thank you!