lmdeploy VLM inference performance

I use Qwen2-vl-2b to do classification task, latency 300ms and QPS 5 with A10 is the performance normal?

quantize awq "max_pixels": 2562828 prompt text 100 token inference tokens 20

Mar 20 '25 08:03 hxyghostor

Unfortunately, we have no A10. Could you let us know how you perform the benchmark?

Mar 21 '25 06:03 lvhan028

seems latency 300ms and QPS 8 with A100

lmdeploy serve api_server /classification/qwen2-vl-2b-4bit-finetune --server-port $PORT0 --model-format awq --quant-policy 8

We have trained Qwen2 VL for the multi-classification task of images. Transfer the image in base64 format along with the prompt, access this service concurrently, and the number of tokens for inference is within 20, and obtain the classification result.

Mar 21 '25 08:03 hxyghostor

May I ask, for multi-modal models like Qwen2 - VL, how to make multi-batch requests? And how to test the performance of the interface when deploying with lmdeploy serve api_server?

Unfortunately, we have no A10. Could you let us know how you perform the benchmark?

Apr 07 '25 09:04 moyans

It can be understood that the batch size I tested is 1. After deploying the API, I wrote concurrent calls to test its performance.

Apr 07 '25 11:04 hxyghostor

v100 device，80w pixels，prompt text 100 token, inference tokens 100, latency 2000ms, QPS 0.5

It can be understood that the batch size I tested is 1. After deploying the API, I wrote concurrent calls to test its performance.

Apr 08 '25 03:04 moyans