vllm [Usage]: Throughput and quality issue with vllm 0.6.0.

As per vllm community, vllm 0.6.0 is improved version with 5x throughput. I have installed vllm==0.6.0 but the throughput remains same as earlier. Also the response quality of output is degraded in this version. Has anyone faced similar issue with this version?

Sep 09 '24 05:09 Agrawalchitranshu

I have done some benchmark with LLMPerf, with 150 requests of 1000 input tokens& 500 output tokens. (Cuda12.4. Nvidia Driver 550)

GPU Model	LLM model	vLLM version	1 concurrent request (token/s)	10 concurrent requests (token/s)
RTX4090	llama3.1 8B fp8	v0.5.3	90	714
RTX4090	llama3.1 8B fp8	v0.6.0	87	680
RTX4090	llama3 8B fp16	v0.4.1	/	484
RTX4090	llama3 8B fp16	v0.6.0	/	488

for llama3.1 8B, I set --max-model-len 80000

All versions of vLLM are from the official images in docker hub.

Sep 09 '24 12:09 cpwan

Benchmark is an art form, and the percentage improvement in throughput varies with different model sizes and graphics cards. vllm0.6.0 is optimized for scenarios with high throughput, particularly where CPU load is significant. You can refer to the testing script provided by sglang for throughput testing. Here is the link: https://github.com/sgl-project/sglang/blob/main/python/sglang/bench_serving.py

Sep 10 '24 07:09 cherishhh

Hello! For the Llama 3.1 70B AWQ 4bit model on 1 x A100, version 0.6.0 even became a little worse. I conduct a test using the comparative benchmark_throught.py: Version 0.6.0 - {'elapsed_time': 305.8413491959218, 'num_requests': 10, 'requests_per_seconds': 0.03269669070676904} version 0.5.5 - {'elapsed_time': 287.37649882701226, 'num_requests' : 10, 'requests_per_sec ': 0.03479755665761504}. That is, there are no improvements for quantum models. The same data for all requests. Input tokens can have from 10 to 27k tokens, and output = 512, max_model_len=32000.

Sep 12 '24 10:09 HelenaSak

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Dec 12 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jan 12 '25 02:01 github-actions[bot]