mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Batched Prefill

Open lucasavila00 opened this issue 9 months ago • 4 comments

lucasavila00 avatar Apr 27 '24 01:04 lucasavila00

Code Metrics Report
  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Total                       70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 69,864
Estimated Schedule Effort 11.811066 months
Estimated People Required 5.038645
───────────────────────────────────────────────────────────────────────────────
Processed 768517 bytes, 0.769 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
  

github-actions[bot] avatar Apr 27 '24 01:04 github-actions[bot]

Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod.

I ran the benchmark and it is not required. The spike only happens if pp=2048, but not pp=512,c=4

One improvement that I see is to allow a batch size greater than one.

Is it worth it supporting parallel prefill? For me, using gguf, it only helps if prefilling <128 tokens. If we have more than 128 tokens, padding makes it slower than non-parallel prefill.

So it's only worth it to run parallel prefill on a batch of very small sequences. Check benchmarks below

+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 586.483±0.000 | 1.705±0.000 |           1 |     586.4834 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 344.665±0.348 | 2.901±0.003 |           2 |     689.3309 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 228.368±0.166 | 4.379±0.003 |           3 |     685.1029 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 172.260±0.111 | 5.805±0.004 |           4 |     689.0406 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| model                              | backend | test  | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 292.237±0.000 | 3.422±0.000 |           1 |    292.23743 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 308.450±2.230 | 3.242±0.023 |           2 |     616.8997 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 206.233±0.828 | 4.849±0.019 |           3 |     618.6996 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 172.977±0.872 | 5.781±0.029 |           4 |     691.9095 |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+

Alternatively, we could not do that and just modify get_prompt_input to return some sort of iterator over the chunks, for which some sequences would not be present

This could work...

In the end we need to decide whether we support parallel prefill... It makes batching more complicated because we need to be aware of the cache and set it up per batch. And it is only faster if we're not padding anything.

lucasavila00 avatar Apr 27 '24 03:04 lucasavila00

@EricLBuehler if we remove parallel prefill then I think the approach of https://github.com/EricLBuehler/mistral.rs/pull/219/commits/a165b7d4dc526deb7e823ae2b5b95d95c42ff358 might work?

lucasavila00 avatar Apr 27 '24 04:04 lucasavila00

So my current reasoning is:

  1. We already have natural parallelization of prefill, as we run multiple tokens at once
  2. Dealing with cache and batches for parallel prefill of multiple sequences is complex
  3. If the prompt size is reasonably big, we're almost at full speed
  4. If we have any padding, it'd be better to run in sequence but with no padding, due to small gains of running multiple sequence prompts in parallel

So I think we should move forward with the current approach. It is only worse if we're running the prompt phase of many tiny requests in parallel.

lucasavila00 avatar Apr 27 '24 04:04 lucasavila00

I cherry-picked the proper commits to https://github.com/EricLBuehler/mistral.rs/pull/234 and I'll close this

lucasavila00 avatar Apr 28 '24 03:04 lucasavila00