vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Performance]: Added request take too much time, and the model will not run untill all the request are added into the cache

Open CuriousCat-7 opened this issue 10 months ago • 5 comments

Proposal to improve performance

INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-1af15bd86d5f413683cd727e1028852c.                                                                                                                                                                              
INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-b4e5eba8d8d144a0813ffb6e378ee784.                                                                                                                                                                              
INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-1ca0f490ea104efc9884777815e51618.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-984040d9c3cf424984a719970de484f5.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-532cbdba66794d859a61423270e06baf.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-083f1271382f4bd189c35a604b137bc8.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-d5c44ff025cc44149c4b64dcd30fa494.                                                                                                                                                                              
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-087039221d0a463a9779b4f072b853ee.                                                                                                                                                                              
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-22734de905b74010910ea9511d27462c.                                                                                                                                                                              
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-3ad72c9c11f84b49ac6ae2437e1064cc.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-180206440e054294b53baf79ffbedce7.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-68d902705e3743a0b72add7e2711f9a0.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-c807fdcd39584de7a80b5c7e278a55c2.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-43bc35cb2cd94141bd9e24fffa06dacc. 

I use 4xA800 to run qwen2-vl, vllm service; I found that when I request 100 requests, the first request will wait until the 100th request is cached. How can I reduce the the latency and maximum the GPU usage?

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

CuriousCat-7 avatar Feb 14 '25 04:02 CuriousCat-7

In my case, the _validate_and_add_requests method accounts for approximately 20% of the total computation time, leaving GPU resources underutilized.

a7744hsc avatar Mar 26 '25 05:03 a7744hsc

In my case, the _validate_and_add_requests method accounts for approximately 20% of the total computation time, leaving GPU resources underutilized.

I forget this post for weeks, I finally "solve" it by setting the --max-num-seqs to a relative small number, like NUM_GPUS. I don't think it is an official solution, but it works in my case

CuriousCat-7 avatar Mar 26 '25 06:03 CuriousCat-7

same problem. any other suggestions?

Adenialzz avatar Apr 18 '25 03:04 Adenialzz

Setting --max-num-seqs to 8 doesn't seem to be helpful.

leng-yue avatar Apr 27 '25 09:04 leng-yue

--max-num-seqs doesn't help It tooks about 500ms Adding Requests..

LugerW-A avatar Jul 03 '25 11:07 LugerW-A

To give better informations, I used the vllm service and every request is made through the openai-api. So I cannot modify the code to use multi-thread of multi process @a7744hsc , cause it already is.
And I think you guys can also try to set --max-num-seqs=1? @leng-yue @LugerW-A I finally give up using this workflow to process a huge amount of data, but turn to using much more distributed solutions, like using ray.

CuriousCat-7 avatar Jul 07 '25 03:07 CuriousCat-7

@CuriousCat-7 Setting a higher OMP_NUM_THREADS value can indeed improve performance, Howerve, as observed in the vLLM project issue (https://github.com/vllm-project/vllm/issues/14538).

The issue suggests that vLLM might be utilizing only one CPU core per GPU. So maybe Too many requests are causing high CPU load.

LugerW-A avatar Jul 08 '25 10:07 LugerW-A

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Oct 07 '25 02:10 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Nov 06 '25 02:11 github-actions[bot]