mistral.rs Quantized: Use cublas for prompt

This

$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m The
Bloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T05:58:00.751771Z  INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T05:58:00.751790Z  INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T05:58:00.751793Z  INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T05:58:00.751810Z  INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T05:58:02.469281Z  INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T05:58:02.499735Z  INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |   57.762±0.583 | 17.314±0.176 |           1 |    57.762444 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 719.654±10.389 |  1.390±0.020 |           1 |     719.6544 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  46.236±1.463 | 21.650±0.692 |           2 |      92.4723 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 458.195±3.664 |  2.183±0.017 |           2 |    916.38995 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  29.450±0.103 | 33.956±0.118 |           4 |    117.80009 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 260.459±1.187 |  3.839±0.018 |           4 |    1041.8367 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+

Master

$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T06:00:24.120633Z  INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T06:00:24.120654Z  INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T06:00:24.120657Z  INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T06:00:24.120671Z  INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T06:00:25.850501Z  INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T06:00:25.882085Z  INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |   58.091±0.919 | 17.219±0.274 |           1 |     58.09086 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 620.491±10.625 |  1.612±0.028 |           1 |     620.4911 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  47.455±0.311 | 21.073±0.138 |           2 |     94.91029 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 328.383±1.779 |  3.045±0.017 |           2 |     656.7665 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  29.232±0.055 | 34.209±0.064 |           4 |   116.927444 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 163.163±0.807 |  6.129±0.030 |           4 |     652.6506 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+

Llama.cpp

$ /home/lucas/oss/llama.cpp/llama-bench  -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1 -b 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 |        512 | pp 512     |   1747.07 ± 0.00 |

build: 7593639c (2679)

Honestly, I don't think it's worth it to merge it just for this small win. This breaks usage for GPUs that don't support F16...

But it's an improvement...

Apr 28 '24 06:04 lucasavila00

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    5            9            9            0            0
 Python                 21          741          622           21           98
 TOML                   16          419          378            1           40
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1026            0          758          268
 |- BASH                 6          205          192            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1537          474          767          296
-------------------------------------------------------------------------------
 Rust                   81        26376        24282          334         1760
 |- Markdown            38          359            0          354            5
 (Total)                          26735        24282          688         1765
===============================================================================
 Total                 143        29047        25685         1114         2248
===============================================================================

Apr 28 '24 06:04 github-actions[bot]

mistral.rs

llama.cpp

Apr 28 '24 06:04 lucasavila00

After direct f16 dequant:

This

+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 1016.721±6.399 | 0.984±0.006 |           1 |    1016.7206 |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+

Master

+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 614.365±2.955 | 1.628±0.008 |           1 |     614.3652 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+

Apr 28 '24 19:04 lucasavila00

@lucasavila00, that's amazing! I am looking forward to merging this.

Can you please add the special matmul function to the layer.rs file so we can use it in all models? Ideally, we can also implement it in QLinear so ISQ can benefit, too.

Apr 28 '24 20:04 EricLBuehler

I need to integrate these candle changes https://github.com/huggingface/candle/pull/2141

And this PR should be ready

I'll try to do it tonight

Apr 29 '24 14:04 lucasavila00

@EricLBuehler would you mind updating your fork? It does not include the precision changes

Apr 29 '24 21:04 lucasavila00

@lucasavila00 sure, I just updated it.

Apr 29 '24 21:04 EricLBuehler

I did not implement it for the other models. I can try to do it later in another MR.

I don't have the bandwidth to work with different models and setups today.

Apr 29 '24 22:04 lucasavila00

Hi @lucasavila00, I'm looking forward to merging this PR!

I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.

May 14 '24 00:05 EricLBuehler

Hi @lucasavila00, I'm looking forward to merging this PR!

I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.

There were changes to the is_prompt bool I was using to decide which version to use. I changed it to a different heuristic but I'm not sure it's optimal.

I'm sorry but I don't have time to benchmark it soon.

May 14 '24 01:05 lucasavila00

Can we check if seq_len>1? I think that would be pretty reliable.

May 14 '24 12:05 EricLBuehler

@lucasavila00 thank you!

May 15 '24 18:05 EricLBuehler

mistral.rs mistral.rs copied to clipboard

Quantized: Use cublas for prompt

mistral.rs
mistral.rs copied to clipboard