vllm [Usage]: low GPU usage in qwen1.5 110b int4 inference

[Usage]: low GPU usage in qwen1.5 110b int4 inference

Open liulfy opened this issue 7 months ago • 0 comments

Your current environment

When I try to inference qwen1.5 110b int4 (https://modelscope.cn/models/qwen/Qwen1.5-110B-Chat-GPTQ-Int4) with vllm(0.4.2) AsyncLlmEngine on A100 80G, I find the real batch size is just 2. I set the params as follows: gpu_memory_utilization = 0.95 max_parallel_loading_workers = 4 swap_space = 4 max_model_len = 1024 max_num_seqs = 8 I don't use beam search. The prompt is 330 tokens, and the output is about 5 tokens. When the QPS=1, each request takes about 0.45 seconds. When the QPS=2, each request takes about 0.70 seconds. When the QPS=4, each request takes about 1.4 seconds. When the QPS=8, each request takes about 2.9 seconds. And according to the output metrics(vllm.sequence.RequestMetrics), the real batch size is just 2. How to improve it? Thanks!

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Jul 09 '24 03:07 liulfy

vllm vllm copied to clipboard

[Usage]: low GPU usage in qwen1.5 110b int4 inference

Your current environment

How would you like to use vllm

vllm
vllm copied to clipboard