vllm [Performance]: Added request take too much time, and the model will not run untill all the request are added into the cache

Proposal to improve performance

INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-1af15bd86d5f413683cd727e1028852c.                                                                                                                                                                              
INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-b4e5eba8d8d144a0813ffb6e378ee784.                                                                                                                                                                              
INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-1ca0f490ea104efc9884777815e51618.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-984040d9c3cf424984a719970de484f5.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-532cbdba66794d859a61423270e06baf.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-083f1271382f4bd189c35a604b137bc8.                                                                                                                                                                              
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-d5c44ff025cc44149c4b64dcd30fa494.                                                                                                                                                                              
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-087039221d0a463a9779b4f072b853ee.                                                                                                                                                                              
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-22734de905b74010910ea9511d27462c.                                                                                                                                                                              
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-3ad72c9c11f84b49ac6ae2437e1064cc.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-180206440e054294b53baf79ffbedce7.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-68d902705e3743a0b72add7e2711f9a0.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-c807fdcd39584de7a80b5c7e278a55c2.                                                                                                                                                                              
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-43bc35cb2cd94141bd9e24fffa06dacc.

I use 4xA800 to run qwen2-vl, vllm service; I found that when I request 100 requests, the first request will wait until the 100th request is cached. How can I reduce the the latency and maximum the GPU usage?

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Feb 14 '25 04:02 CuriousCat-7

In my case, the _validate_and_add_requests method accounts for approximately 20% of the total computation time, leaving GPU resources underutilized.

Mar 26 '25 05:03 a7744hsc

In my case, the _validate_and_add_requests method accounts for approximately 20% of the total computation time, leaving GPU resources underutilized.

I forget this post for weeks, I finally "solve" it by setting the --max-num-seqs to a relative small number, like NUM_GPUS. I don't think it is an official solution, but it works in my case

Mar 26 '25 06:03 CuriousCat-7

same problem. any other suggestions?

Apr 18 '25 03:04 Adenialzz

you can try to modify the code to use multi thread or multi process.

On Fri, Apr 18, 2025 at 11:12 AM Junjie @.***> wrote:

same problem. any other suggestions?

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/13259#issuecomment-2814420042 or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNSIKMYKEVJXK45XP4RB5L22BUSFBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVI2TINZRHE4DOMZQGGSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJVHE4TKNBXGUYTRAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEOBVGI3DCOJYGUZIFJDUPFYGLJLMMFRGK3FFOZQWY5LFVI2TINZRHE4DOMZQGGTXI4TJM5TWK4VGMNZGKYLUMU . You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

Apr 21 '25 01:04 a7744hsc

Setting --max-num-seqs to 8 doesn't seem to be helpful.

Apr 27 '25 09:04 leng-yue

--max-num-seqs doesn't help It tooks about 500ms Adding Requests..

Jul 03 '25 11:07 LugerW-A

To give better informations, I used the vllm service and every request is made through the openai-api. So I cannot modify the code to use multi-thread of multi process @a7744hsc , cause it already is.
And I think you guys can also try to set --max-num-seqs=1? @leng-yue @LugerW-A I finally give up using this workflow to process a huge amount of data, but turn to using much more distributed solutions, like using ray.

Jul 07 '25 03:07 CuriousCat-7

@CuriousCat-7 Setting a higher OMP_NUM_THREADS value can indeed improve performance, Howerve, as observed in the vLLM project issue (https://github.com/vllm-project/vllm/issues/14538).

The issue suggests that vLLM might be utilizing only one CPU core per GPU. So maybe Too many requests are causing high CPU load.

Jul 08 '25 10:07 LugerW-A

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Oct 07 '25 02:10 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Nov 06 '25 02:11 github-actions[bot]