[Performance]: Added request take too much time, and the model will not run untill all the request are added into the cache
Proposal to improve performance
INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-1af15bd86d5f413683cd727e1028852c.
INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-b4e5eba8d8d144a0813ffb6e378ee784.
INFO 02-14 11:57:32 engine.py:275] Added request chatcmpl-1ca0f490ea104efc9884777815e51618.
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-984040d9c3cf424984a719970de484f5.
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-532cbdba66794d859a61423270e06baf.
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-083f1271382f4bd189c35a604b137bc8.
INFO 02-14 11:57:33 engine.py:275] Added request chatcmpl-d5c44ff025cc44149c4b64dcd30fa494.
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-087039221d0a463a9779b4f072b853ee.
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-22734de905b74010910ea9511d27462c.
INFO 02-14 11:57:34 engine.py:275] Added request chatcmpl-3ad72c9c11f84b49ac6ae2437e1064cc.
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-180206440e054294b53baf79ffbedce7.
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-68d902705e3743a0b72add7e2711f9a0.
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-c807fdcd39584de7a80b5c7e278a55c2.
INFO 02-14 11:57:35 engine.py:275] Added request chatcmpl-43bc35cb2cd94141bd9e24fffa06dacc.
I use 4xA800 to run qwen2-vl, vllm service; I found that when I request 100 requests, the first request will wait until the 100th request is cached. How can I reduce the the latency and maximum the GPU usage?
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
In my case, the _validate_and_add_requests method accounts for approximately 20% of the total computation time, leaving GPU resources underutilized.
In my case, the _validate_and_add_requests method accounts for approximately 20% of the total computation time, leaving GPU resources underutilized.
I forget this post for weeks, I finally "solve" it by setting the --max-num-seqs to a relative small number, like NUM_GPUS. I don't think it is an official solution, but it works in my case
same problem. any other suggestions?
you can try to modify the code to use multi thread or multi process.
On Fri, Apr 18, 2025 at 11:12 AM Junjie @.***> wrote:
same problem. any other suggestions?
— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/13259#issuecomment-2814420042 or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNSIKMYKEVJXK45XP4RB5L22BUSFBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVI2TINZRHE4DOMZQGGSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJVHE4TKNBXGUYTRAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEOBVGI3DCOJYGUZIFJDUPFYGLJLMMFRGK3FFOZQWY5LFVI2TINZRHE4DOMZQGGTXI4TJM5TWK4VGMNZGKYLUMU . You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .
Setting --max-num-seqs to 8 doesn't seem to be helpful.
--max-num-seqs doesn't help It tooks about 500ms Adding Requests..
To give better informations, I used the vllm service and every request is made through the openai-api. So I cannot modify the code to use multi-thread of multi process @a7744hsc , cause it already is.
And I think you guys can also try to set --max-num-seqs=1? @leng-yue @LugerW-A
I finally give up using this workflow to process a huge amount of data, but turn to using much more distributed solutions, like using ray.
@CuriousCat-7 Setting a higher OMP_NUM_THREADS value can indeed improve performance, Howerve, as observed in the vLLM project issue (https://github.com/vllm-project/vllm/issues/14538).
The issue suggests that vLLM might be utilizing only one CPU core per GPU. So maybe Too many requests are causing high CPU load.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!