mistral.rs
mistral.rs copied to clipboard
Batched Prefill
Code Metrics Report
─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 70 23339 1550 508 21281 1281 ─────────────────────────────────────────────────────────────────────────────── Total 70 23339 1550 508 21281 1281 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 69,864 Estimated Schedule Effort 11.811066 months Estimated People Required 5.038645 ─────────────────────────────────────────────────────────────────────────────── Processed 768517 bytes, 0.769 megabytes (SI) ───────────────────────────────────────────────────────────────────────────────
Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod.
I ran the benchmark and it is not required. The spike only happens if pp=2048, but not pp=512,c=4
One improvement that I see is to allow a batch size greater than one.
Is it worth it supporting parallel prefill? For me, using gguf, it only helps if prefilling <128 tokens. If we have more than 128 tokens, padding makes it slower than non-parallel prefill.
So it's only worth it to run parallel prefill on a batch of very small sequences. Check benchmarks below
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 586.483±0.000 | 1.705±0.000 | 1 | 586.4834 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 344.665±0.348 | 2.901±0.003 | 2 | 689.3309 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 228.368±0.166 | 4.379±0.003 | 3 | 685.1029 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 512 | 172.260±0.111 | 5.805±0.004 | 4 | 689.0406 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 64 | 292.237±0.000 | 3.422±0.000 | 1 | 292.23743 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 64 | 308.450±2.230 | 3.242±0.023 | 2 | 616.8997 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 64 | 206.233±0.828 | 4.849±0.019 | 3 | 618.6996 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA | pp 64 | 172.977±0.872 | 5.781±0.029 | 4 | 691.9095 |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
Alternatively, we could not do that and just modify get_prompt_input to return some sort of iterator over the chunks, for which some sequences would not be present
This could work...
In the end we need to decide whether we support parallel prefill... It makes batching more complicated because we need to be aware of the cache and set it up per batch. And it is only faster if we're not padding anything.
@EricLBuehler if we remove parallel prefill then I think the approach of https://github.com/EricLBuehler/mistral.rs/pull/219/commits/a165b7d4dc526deb7e823ae2b5b95d95c42ff358 might work?
So my current reasoning is:
- We already have natural parallelization of prefill, as we run multiple tokens at once
- Dealing with cache and batches for parallel prefill of multiple sequences is complex
- If the prompt size is reasonably big, we're almost at full speed
- If we have any padding, it'd be better to run in sequence but with no padding, due to small gains of running multiple sequence prompts in parallel
So I think we should move forward with the current approach. It is only worse if we're running the prompt phase of many tiny requests in parallel.
I cherry-picked the proper commits to https://github.com/EricLBuehler/mistral.rs/pull/234 and I'll close this