vllm
vllm copied to clipboard
[Usage]: low GPU usage in qwen1.5 110b int4 inference
Your current environment
When I try to inference qwen1.5 110b int4 (https://modelscope.cn/models/qwen/Qwen1.5-110B-Chat-GPTQ-Int4) with vllm(0.4.2) AsyncLlmEngine on A100 80G, I find the real batch size is just 2. I set the params as follows: gpu_memory_utilization = 0.95 max_parallel_loading_workers = 4 swap_space = 4 max_model_len = 1024 max_num_seqs = 8 I don't use beam search. The prompt is 330 tokens, and the output is about 5 tokens. When the QPS=1, each request takes about 0.45 seconds. When the QPS=2, each request takes about 0.70 seconds. When the QPS=4, each request takes about 1.4 seconds. When the QPS=8, each request takes about 2.9 seconds. And according to the output metrics(vllm.sequence.RequestMetrics), the real batch size is just 2. How to improve it? Thanks!
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.