vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1] Refactor parallel sampling support

Open markmc opened this issue 9 months ago • 7 comments

The initial implementation in #10980 went to great efforts to add parallel sampling as a wrapper at the highest layer of abstraction possible. This resulted in a lot of tricky code to post-process RequestOutputs to aggregate them where necessary.

Instead, it probably makes sense to implement parallel sampling at the layer that actually creates RequestOutput objects - i.e. in OutputProcessor

To do this, we simply need to allow for fanning out child requests in LLMEngine.add_request(), passing details of the fan-out to OutputProcessor.

This adds some overhead to the n=1 case (see SingularSamplingRequest) in return for significantly less overhead and complication in the parallel sampling case.

markmc avatar Feb 24 '25 16:02 markmc

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Feb 24 '25 16:02 github-actions[bot]

Nice work cleaning this up!

This adds some overhead to the n=1 case (see SingularSamplingRequest) in return for significantly less overhead and complication in the parallel sampling case.

We should verify that this overhead is negligible with a quick benchmark

mgoin avatar Feb 24 '25 17:02 mgoin

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Feb 25 '25 10:02 mergify[bot]

Initial benchmarking results with n=1 (averaged over 3x runs) - it appears that there is not a performance regression. I may try a serving benchmark later.

main:

  • Throughput: 41.7 rps
  • P99 LLMEngine average execution time: 2.50 sec

parallel-sampling-refactor:

  • Throughput (rps): 42.0 rps
  • P99 LLMEngine average execution time: 2.47 sec

Command to measure LLMEngine throughput:

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.2-3B-Instruct --input-len 128 --output-len 512

Command to measure LLMEngine average execution time:

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-3B-Instruct --input-len 128 --output-len 512

afeldman-nm avatar Feb 27 '25 16:02 afeldman-nm

Async serving benchmarking results with n=1 (averaged over 3x runs) - it appears that there is not a performance regression. Since serving benchmark results can have high variance on beaker, for each metric I also include (loosely speaking) the range of each metric, computed as "worst" minus "best" as a percentage of the "best".

main:

  • Throughput: 79.44 rps (Lowest-Highest %: -2.3%)
  • P99 TTFT: 3512.56 ms (Highest-Lowest %: 5.3%)
  • P99 TPOT: 110.70ms (Highest-Lowest %: 21.6%)
  • P99 ITL: 114.84ms (Highest-Lowest %: 3.0%)

parallel-sampling-refactor:

  • Throughput: 80.49 rps (Lowest-Highest %: -5.3%), +1% improvement over main
  • P99 TTFT: 3289.69 ms (Highest-Lowest %: 6.4%), -6% improvement over main
  • P99 TPOT: 98.53ms (Highest-Lowest %: 1.52%), -11% improvement over main
  • P99 ITL: 111.13ms (Highest-Lowest %: 0.96%), -3% improvement over main

Command to bring up vLLM v1 engine server:

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8091 --disable-log-requests --no-enable-prefix-caching

Benchmark command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B-Instruct --dataset-path ../sharegpt.json  --port 8091

afeldman-nm avatar Feb 27 '25 19:02 afeldman-nm

Nice work cleaning this up!

This adds some overhead to the n=1 case (see SingularSamplingRequest) in return for significantly less overhead and complication in the parallel sampling case.

We should verify that this overhead is negligible with a quick benchmark

Based on the benchmarking results above, there does not appear to be a perf regression.

afeldman-nm avatar Feb 27 '25 19:02 afeldman-nm

Lint and deploy minio setup failing with:

make_bucket failed: s3://testbucket Could not connect to the endpoint URL: "http://127.0.0.1:9000/testbucket"

markmc avatar Mar 03 '25 11:03 markmc