vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Usage]: Throughput and quality issue with vllm 0.6.0.

Open Agrawalchitranshu opened this issue 1 year ago • 3 comments

As per vllm community, vllm 0.6.0 is improved version with 5x throughput. I have installed vllm==0.6.0 but the throughput remains same as earlier. Also the response quality of output is degraded in this version. Has anyone faced similar issue with this version?

Agrawalchitranshu avatar Sep 09 '24 05:09 Agrawalchitranshu

I have done some benchmark with LLMPerf, with 150 requests of 1000 input tokens& 500 output tokens. (Cuda12.4. Nvidia Driver 550)

GPU Model LLM model vLLM version 1 concurrent request (token/s) 10 concurrent requests (token/s)
RTX4090 llama3.1 8B fp8 v0.5.3 90 714
RTX4090 llama3.1 8B fp8 v0.6.0 87 680
RTX4090 llama3 8B fp16 v0.4.1 / 484
RTX4090 llama3 8B fp16 v0.6.0 / 488

for llama3.1 8B, I set --max-model-len 80000

All versions of vLLM are from the official images in docker hub.

cpwan avatar Sep 09 '24 12:09 cpwan

Benchmark is an art form, and the percentage improvement in throughput varies with different model sizes and graphics cards. vllm0.6.0 is optimized for scenarios with high throughput, particularly where CPU load is significant. You can refer to the testing script provided by sglang for throughput testing. Here is the link: https://github.com/sgl-project/sglang/blob/main/python/sglang/bench_serving.py

cherishhh avatar Sep 10 '24 07:09 cherishhh

Hello! For the Llama 3.1 70B AWQ 4bit model on 1 x A100, version 0.6.0 even became a little worse. I conduct a test using the comparative benchmark_throught.py: Version 0.6.0 - {'elapsed_time': 305.8413491959218, 'num_requests': 10, 'requests_per_seconds': 0.03269669070676904} version 0.5.5 - {'elapsed_time': 287.37649882701226, 'num_requests' : 10, 'requests_per_sec ': 0.03479755665761504}. That is, there are no improvements for quantum models. The same data for all requests. Input tokens can have from 10 to 27k tokens, and output = 512, max_model_len=32000.

HelenaSak avatar Sep 12 '24 10:09 HelenaSak

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Dec 12 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jan 12 '25 02:01 github-actions[bot]