mistral.rs Batched Prefill

Apr 27 '24 01:04 lucasavila00

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Total                       70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 69,864
Estimated Schedule Effort 11.811066 months
Estimated People Required 5.038645
───────────────────────────────────────────────────────────────────────────────
Processed 768517 bytes, 0.769 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

Apr 27 '24 01:04 github-actions[bot]

Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod.

I ran the benchmark and it is not required. The spike only happens if pp=2048, but not pp=512,c=4

One improvement that I see is to allow a batch size greater than one.

Is it worth it supporting parallel prefill? For me, using gguf, it only helps if prefilling <128 tokens. If we have more than 128 tokens, padding makes it slower than non-parallel prefill.

So it's only worth it to run parallel prefill on a batch of very small sequences. Check benchmarks below

+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 586.483±0.000 | 1.705±0.000 |           1 |     586.4834 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 344.665±0.348 | 2.901±0.003 |           2 |     689.3309 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 228.368±0.166 | 4.379±0.003 |           3 |     685.1029 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 172.260±0.111 | 5.805±0.004 |           4 |     689.0406 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+

+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| model                              | backend | test  | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 292.237±0.000 | 3.422±0.000 |           1 |    292.23743 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 308.450±2.230 | 3.242±0.023 |           2 |     616.8997 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 206.233±0.828 | 4.849±0.019 |           3 |     618.6996 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 172.977±0.872 | 5.781±0.029 |           4 |     691.9095 |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+

Alternatively, we could not do that and just modify get_prompt_input to return some sort of iterator over the chunks, for which some sequences would not be present

This could work...

In the end we need to decide whether we support parallel prefill... It makes batching more complicated because we need to be aware of the cache and set it up per batch. And it is only faster if we're not padding anything.

Apr 27 '24 03:04 lucasavila00

@EricLBuehler if we remove parallel prefill then I think the approach of https://github.com/EricLBuehler/mistral.rs/pull/219/commits/a165b7d4dc526deb7e823ae2b5b95d95c42ff358 might work?

Apr 27 '24 04:04 lucasavila00

So my current reasoning is:

We already have natural parallelization of prefill, as we run multiple tokens at once
Dealing with cache and batches for parallel prefill of multiple sequences is complex
If the prompt size is reasonably big, we're almost at full speed
If we have any padding, it'd be better to run in sequence but with no padding, due to small gains of running multiple sequence prompts in parallel

So I think we should move forward with the current approach. It is only worse if we're running the prompt phase of many tiny requests in parallel.

Apr 27 '24 04:04 lucasavila00

I cherry-picked the proper commits to https://github.com/EricLBuehler/mistral.rs/pull/234 and I'll close this

Apr 28 '24 03:04 lucasavila00

mistral.rs mistral.rs copied to clipboard

Batched Prefill

mistral.rs
mistral.rs copied to clipboard