lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

VLM inference performance

Open hxyghostor opened this issue 9 months ago • 5 comments

I use Qwen2-vl-2b to do classification task, latency 300ms and QPS 5 with A10 is the performance normal?

quantize awq "max_pixels": 2562828 prompt text 100 token inference tokens 20

hxyghostor avatar Mar 20 '25 08:03 hxyghostor

Unfortunately, we have no A10. Could you let us know how you perform the benchmark?

lvhan028 avatar Mar 21 '25 06:03 lvhan028

seems latency 300ms and QPS 8 with A100

lmdeploy serve api_server /classification/qwen2-vl-2b-4bit-finetune --server-port $PORT0 --model-format awq --quant-policy 8

We have trained Qwen2 VL for the multi-classification task of images. Transfer the image in base64 format along with the prompt, access this service concurrently, and the number of tokens for inference is within 20, and obtain the classification result.

hxyghostor avatar Mar 21 '25 08:03 hxyghostor

May I ask, for multi-modal models like Qwen2 - VL, how to make multi-batch requests? And how to test the performance of the interface when deploying with lmdeploy serve api_server?

Unfortunately, we have no A10. Could you let us know how you perform the benchmark?

moyans avatar Apr 07 '25 09:04 moyans

It can be understood that the batch size I tested is 1. After deploying the API, I wrote concurrent calls to test its performance.

hxyghostor avatar Apr 07 '25 11:04 hxyghostor

v100 device,80w pixels,prompt text 100 token, inference tokens 100, latency 2000ms, QPS 0.5

It can be understood that the batch size I tested is 1. After deploying the API, I wrote concurrent calls to test its performance.

moyans avatar Apr 08 '25 03:04 moyans