LugerW-A
LugerW-A
Did you solve it?
--max-num-seqs doesn't help It tooks about 500ms Adding Requests..
@CuriousCat-7 Setting a higher OMP_NUM_THREADS value can indeed improve performance, Howerve, as observed in the vLLM project issue (https://github.com/vllm-project/vllm/issues/14538). The issue suggests that vLLM might be utilizing only one CPU...
How is the progress?
> > It's supported, pls see examples/multimodal for more info. > > Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time...
hi did you solve it?
Is it because of the Python GIL that a single GPU can only use one CPU core, and are there any solutions to this problem now?
what about other version of vllm?
And Using CPU needs 400G Mem for a 20B model, also too slow..
Really looking forward to the VLM's Eagle3 adaptation