mistral.rs Async sampling

./target/profiling/mistralrs-bench -p 0 -g 64 -r 1 -c 8  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Master

This PR

Apr 23 '24 03:04 lucasavila00

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23237     1543       508    21186       1278
───────────────────────────────────────────────────────────────────────────────
Total                       70     23237     1543       508    21186       1278
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 66,725
Estimated Schedule Effort 11.790000 months
Estimated People Required 5.023991
───────────────────────────────────────────────────────────────────────────────
Processed 764768 bytes, 0.765 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

Apr 23 '24 03:04 github-actions[bot]

@lucasavila00, thanks for your work here. I think we definitely need to pursue the throughput decrease as the batch increase. I like the idea of async being incorporated, did this not work out?

Apr 26 '24 00:04 EricLBuehler

@EricLBuehler It had mixed results.

On bs=8 it improved from 5ms to 2ms.

But on bs=1 it got worse, from 0.8ms to 1.2ms.

Clippy raised an issue of a non-async mutex being held across an await point.

The best fix for it would be to use an async aware mutex? Or drop the lock a lot of times and re-lock when needed?

I'm still learning async rust and don't feel confident enough to work on this MR yet, as it requires defining the overall async structure for the engine.

Also, profiling CPU code became harder (eg: samply profiler). Only the nvidia profiler showed interpretable results.

Apr 26 '24 01:04 lucasavila00

On bs=8 it improved from 5ms to 2ms.

Great! Perhaps we could profile it and see when the performance gains go away, and use this then.

Clippy raised an issue of a non-async mutex being held across an await point.

We could use this type.

I think this sort of structure is very interesting, I'll take a look in the next few days. Thanks for working on it.

Apr 26 '24 01:04 EricLBuehler

I'll leave it open, so it picks up my commits.

I just fixed the clippy issue, by not holding the lock across awaits.

Apr 26 '24 03:04 lucasavila00

@EricLBuehler cargo test fails because it can't download the mistral tokenizer from HF.

Any chance the CI account does not have access to the model, which has been recently locked (locked like llama where one needs to request access)

Apr 26 '24 04:04 lucasavila00

Great! Perhaps we could profile it and see when the performance gains go away, and use this then.

I implemented it in the last commit. It only uses the async pool if there's more than one batch. Then, the performance loss is gone.

It means that the async code is free if not used.

Apr 26 '24 04:04 lucasavila00

I added 2 nvidia profiler screenshots to the main PR body, comparing master to this PR, showing the improvements.

Apr 26 '24 06:04 lucasavila00

@lucasavila00

cargo test fails because it can't download the mistral tokenizer from HF.

I set the HF_TOKEN secret during CI:

https://github.com/EricLBuehler/mistral.rs/blob/9ec90a40949efe9c68828c048da136874d1ba2f1/.github/workflows/ci.yml#L49

Apr 26 '24 08:04 EricLBuehler

@lucasavila00, it looks like there are some merge conflicts.

Apr 26 '24 19:04 EricLBuehler

@lucasavila00, it looks like there are some merge conflicts.

I'm fixing it

Apr 26 '24 19:04 lucasavila00

Master

+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 58.712±0.644 | 17.034±0.190 |           1 |    58.712006 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 45.956±0.805 | 21.766±0.380 |           2 |      91.9128 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 29.011±0.321 | 34.473±0.388 |           4 |     116.0458 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 15.469±0.388 | 64.685±1.639 |           8 |    123.75505 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+

This

+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 58.752±0.501 | 17.022±0.147 |           1 |    58.752266 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 48.121±0.124 | 20.781±0.053 |           2 |     96.24125 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 30.030±0.020 | 33.300±0.022 |           4 |    120.12018 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 16.839±0.008 | 59.384±0.027 |           8 |    134.71562 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+

Apr 26 '24 20:04 lucasavila00

@lucasavila00, this looks good. However, I think there is one more merge conflict.

Apr 27 '24 09:04 EricLBuehler

@lucasavila00, I think there are unfortunately still some merge conflicts.

Apr 28 '24 01:04 EricLBuehler

@EricLBuehler please consider squashing this MR

Or please let me know if I should squash it

Apr 28 '24 01:04 lucasavila00

@lucasavila00 thank you for adding this!

Apr 28 '24 01:04 EricLBuehler

mistral.rs mistral.rs copied to clipboard

Async sampling

mistral.rs
mistral.rs copied to clipboard