mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Async sampling

Open lucasavila00 opened this issue 10 months ago • 14 comments

./target/profiling/mistralrs-bench -p 0 -g 64 -r 1 -c 8  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Master

image

This PR

image

lucasavila00 avatar Apr 23 '24 03:04 lucasavila00

Code Metrics Report
  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23237     1543       508    21186       1278
───────────────────────────────────────────────────────────────────────────────
Total                       70     23237     1543       508    21186       1278
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 66,725
Estimated Schedule Effort 11.790000 months
Estimated People Required 5.023991
───────────────────────────────────────────────────────────────────────────────
Processed 764768 bytes, 0.765 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
  

github-actions[bot] avatar Apr 23 '24 03:04 github-actions[bot]

@lucasavila00, thanks for your work here. I think we definitely need to pursue the throughput decrease as the batch increase. I like the idea of async being incorporated, did this not work out?

EricLBuehler avatar Apr 26 '24 00:04 EricLBuehler

@EricLBuehler It had mixed results.

On bs=8 it improved from 5ms to 2ms.

But on bs=1 it got worse, from 0.8ms to 1.2ms.

Clippy raised an issue of a non-async mutex being held across an await point.

The best fix for it would be to use an async aware mutex? Or drop the lock a lot of times and re-lock when needed?

I'm still learning async rust and don't feel confident enough to work on this MR yet, as it requires defining the overall async structure for the engine.

Also, profiling CPU code became harder (eg: samply profiler). Only the nvidia profiler showed interpretable results.

lucasavila00 avatar Apr 26 '24 01:04 lucasavila00

On bs=8 it improved from 5ms to 2ms.

Great! Perhaps we could profile it and see when the performance gains go away, and use this then.

Clippy raised an issue of a non-async mutex being held across an await point.

We could use this type.

I think this sort of structure is very interesting, I'll take a look in the next few days. Thanks for working on it.

EricLBuehler avatar Apr 26 '24 01:04 EricLBuehler

I'll leave it open, so it picks up my commits.

I just fixed the clippy issue, by not holding the lock across awaits.

lucasavila00 avatar Apr 26 '24 03:04 lucasavila00

@EricLBuehler cargo test fails because it can't download the mistral tokenizer from HF.

Any chance the CI account does not have access to the model, which has been recently locked (locked like llama where one needs to request access)

lucasavila00 avatar Apr 26 '24 04:04 lucasavila00

Great! Perhaps we could profile it and see when the performance gains go away, and use this then.

I implemented it in the last commit. It only uses the async pool if there's more than one batch. Then, the performance loss is gone.

It means that the async code is free if not used.

lucasavila00 avatar Apr 26 '24 04:04 lucasavila00

I added 2 nvidia profiler screenshots to the main PR body, comparing master to this PR, showing the improvements.

lucasavila00 avatar Apr 26 '24 06:04 lucasavila00

@lucasavila00

cargo test fails because it can't download the mistral tokenizer from HF.

I set the HF_TOKEN secret during CI:

https://github.com/EricLBuehler/mistral.rs/blob/9ec90a40949efe9c68828c048da136874d1ba2f1/.github/workflows/ci.yml#L49

EricLBuehler avatar Apr 26 '24 08:04 EricLBuehler

@lucasavila00, it looks like there are some merge conflicts.

EricLBuehler avatar Apr 26 '24 19:04 EricLBuehler

@lucasavila00, it looks like there are some merge conflicts.

I'm fixing it

lucasavila00 avatar Apr 26 '24 19:04 lucasavila00

Master

+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 58.712±0.644 | 17.034±0.190 |           1 |    58.712006 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 45.956±0.805 | 21.766±0.380 |           2 |      91.9128 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 29.011±0.321 | 34.473±0.388 |           4 |     116.0458 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 15.469±0.388 | 64.685±1.639 |           8 |    123.75505 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+

This

+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 58.752±0.501 | 17.022±0.147 |           1 |    58.752266 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 48.121±0.124 | 20.781±0.053 |           2 |     96.24125 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 30.030±0.020 | 33.300±0.022 |           4 |    120.12018 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 16.839±0.008 | 59.384±0.027 |           8 |    134.71562 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+

lucasavila00 avatar Apr 26 '24 20:04 lucasavila00

@lucasavila00, this looks good. However, I think there is one more merge conflict.

EricLBuehler avatar Apr 27 '24 09:04 EricLBuehler

@lucasavila00, I think there are unfortunately still some merge conflicts.

EricLBuehler avatar Apr 28 '24 01:04 EricLBuehler

@EricLBuehler please consider squashing this MR

Or please let me know if I should squash it

lucasavila00 avatar Apr 28 '24 01:04 lucasavila00

@lucasavila00 thank you for adding this!

EricLBuehler avatar Apr 28 '24 01:04 EricLBuehler