mistral.rs
mistral.rs copied to clipboard
Async sampling
./target/profiling/mistralrs-bench -p 0 -g 64 -r 1 -c 8 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
Master
This PR
Code Metrics Report
─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 70 23237 1543 508 21186 1278 ─────────────────────────────────────────────────────────────────────────────── Total 70 23237 1543 508 21186 1278 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 66,725 Estimated Schedule Effort 11.790000 months Estimated People Required 5.023991 ─────────────────────────────────────────────────────────────────────────────── Processed 764768 bytes, 0.765 megabytes (SI) ───────────────────────────────────────────────────────────────────────────────
@lucasavila00, thanks for your work here. I think we definitely need to pursue the throughput decrease as the batch increase. I like the idea of async being incorporated, did this not work out?
@EricLBuehler It had mixed results.
On bs=8 it improved from 5ms to 2ms.
But on bs=1 it got worse, from 0.8ms to 1.2ms.
Clippy raised an issue of a non-async mutex being held across an await point.
The best fix for it would be to use an async aware mutex? Or drop the lock a lot of times and re-lock when needed?
I'm still learning async rust and don't feel confident enough to work on this MR yet, as it requires defining the overall async structure for the engine.
Also, profiling CPU code became harder (eg: samply profiler). Only the nvidia profiler showed interpretable results.
On bs=8 it improved from 5ms to 2ms.
Great! Perhaps we could profile it and see when the performance gains go away, and use this then.
Clippy raised an issue of a non-async mutex being held across an await point.
We could use this type.
I think this sort of structure is very interesting, I'll take a look in the next few days. Thanks for working on it.
I'll leave it open, so it picks up my commits.
I just fixed the clippy issue, by not holding the lock across awaits.
@EricLBuehler cargo test fails because it can't download the mistral tokenizer from HF.
Any chance the CI account does not have access to the model, which has been recently locked (locked like llama where one needs to request access)
Great! Perhaps we could profile it and see when the performance gains go away, and use this then.
I implemented it in the last commit. It only uses the async pool if there's more than one batch. Then, the performance loss is gone.
It means that the async code is free if not used.
I added 2 nvidia profiler screenshots to the main PR body, comparing master to this PR, showing the improvements.
@lucasavila00
cargo test fails because it can't download the mistral tokenizer from HF.
I set the HF_TOKEN
secret during CI:
https://github.com/EricLBuehler/mistral.rs/blob/9ec90a40949efe9c68828c048da136874d1ba2f1/.github/workflows/ci.yml#L49
@lucasavila00, it looks like there are some merge conflicts.
@lucasavila00, it looks like there are some merge conflicts.
I'm fixing it
Master
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 58.712±0.644 | 17.034±0.190 | 1 | 58.712006 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 45.956±0.805 | 21.766±0.380 | 2 | 91.9128 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 29.011±0.321 | 34.473±0.388 | 4 | 116.0458 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 15.469±0.388 | 64.685±1.639 | 8 | 123.75505 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
This
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 58.752±0.501 | 17.022±0.147 | 1 | 58.752266 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 48.121±0.124 | 20.781±0.053 | 2 | 96.24125 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 30.030±0.020 | 33.300±0.022 | 4 | 120.12018 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | tg 128 | 16.839±0.008 | 59.384±0.027 | 8 | 134.71562 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
@lucasavila00, this looks good. However, I think there is one more merge conflict.
@lucasavila00, I think there are unfortunately still some merge conflicts.
@EricLBuehler please consider squashing this MR
Or please let me know if I should squash it
@lucasavila00 thank you for adding this!