VLM inference performance
I use Qwen2-vl-2b to do classification task, latency 300ms and QPS 5 with A10 is the performance normal?
quantize awq "max_pixels": 2562828 prompt text 100 token inference tokens 20
Unfortunately, we have no A10. Could you let us know how you perform the benchmark?
seems latency 300ms and QPS 8 with A100
lmdeploy serve api_server /classification/qwen2-vl-2b-4bit-finetune --server-port $PORT0 --model-format awq --quant-policy 8
We have trained Qwen2 VL for the multi-classification task of images. Transfer the image in base64 format along with the prompt, access this service concurrently, and the number of tokens for inference is within 20, and obtain the classification result.
May I ask, for multi-modal models like Qwen2 - VL, how to make multi-batch requests? And how to test the performance of the interface when deploying with lmdeploy serve api_server?
Unfortunately, we have no A10. Could you let us know how you perform the benchmark?
It can be understood that the batch size I tested is 1. After deploying the API, I wrote concurrent calls to test its performance.
v100 device,80w pixels,prompt text 100 token, inference tokens 100, latency 2000ms, QPS 0.5
It can be understood that the batch size I tested is 1. After deploying the API, I wrote concurrent calls to test its performance.