vllm
vllm copied to clipboard
[V1] Refactor parallel sampling support
The initial implementation in #10980 went to great efforts to add parallel sampling as a wrapper at the highest layer of abstraction possible. This resulted in a lot of tricky code to post-process RequestOutputs to aggregate them where necessary.
Instead, it probably makes sense to implement parallel sampling at the layer that actually creates RequestOutput objects - i.e. in OutputProcessor
To do this, we simply need to allow for fanning out child requests in LLMEngine.add_request(), passing details of the fan-out to OutputProcessor.
This adds some overhead to the n=1 case (see SingularSamplingRequest) in return for significantly less overhead and complication in the parallel sampling case.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Nice work cleaning this up!
This adds some overhead to the n=1 case (see SingularSamplingRequest) in return for significantly less overhead and complication in the parallel sampling case.
We should verify that this overhead is negligible with a quick benchmark
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @markmc.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Initial benchmarking results with n=1 (averaged over 3x runs) - it appears that there is not a performance regression. I may try a serving benchmark later.
main:
- Throughput: 41.7 rps
- P99
LLMEngineaverage execution time: 2.50 sec
parallel-sampling-refactor:
- Throughput (rps): 42.0 rps
- P99
LLMEngineaverage execution time: 2.47 sec
Command to measure LLMEngine throughput:
VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.2-3B-Instruct --input-len 128 --output-len 512
Command to measure LLMEngine average execution time:
VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-3B-Instruct --input-len 128 --output-len 512
Async serving benchmarking results with n=1 (averaged over 3x runs) - it appears that there is not a performance regression. Since serving benchmark results can have high variance on beaker, for each metric I also include (loosely speaking) the range of each metric, computed as "worst" minus "best" as a percentage of the "best".
main:
- Throughput: 79.44 rps (Lowest-Highest %: -2.3%)
- P99 TTFT: 3512.56 ms (Highest-Lowest %: 5.3%)
- P99 TPOT: 110.70ms (Highest-Lowest %: 21.6%)
- P99 ITL: 114.84ms (Highest-Lowest %: 3.0%)
parallel-sampling-refactor:
- Throughput: 80.49 rps (Lowest-Highest %: -5.3%), +1% improvement over
main - P99 TTFT: 3289.69 ms (Highest-Lowest %: 6.4%), -6% improvement over
main - P99 TPOT: 98.53ms (Highest-Lowest %: 1.52%), -11% improvement over
main - P99 ITL: 111.13ms (Highest-Lowest %: 0.96%), -3% improvement over
main
Command to bring up vLLM v1 engine server:
VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8091 --disable-log-requests --no-enable-prefix-caching
Benchmark command:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B-Instruct --dataset-path ../sharegpt.json --port 8091
Nice work cleaning this up!
This adds some overhead to the n=1 case (see SingularSamplingRequest) in return for significantly less overhead and complication in the parallel sampling case.
We should verify that this overhead is negligible with a quick benchmark
Based on the benchmarking results above, there does not appear to be a perf regression.
Lint and deploy minio setup failing with:
make_bucket failed: s3://testbucket Could not connect to the endpoint URL: "http://127.0.0.1:9000/testbucket"