Qwen2.5 VL sglang's output much worse than transformers
I tried serving qwen2.5 vl 72B using sglang on a node with 4*A40 GPUs. The image I used is the official sglang:v0.4.3.post2-cu125 The command:
python3 -m sglang.launch_server \
--tp $NUM_SHARD \
--mem-fraction-static 0.99 \
--disable-cuda-graph \
--model-path /model/Qwen2.5-VL-72B-Instruct \
--host 0.0.0.0 \
--port 23333
I tested using an internal image classification dataset, the results were much worse than when using transformers, acc droped from 87% to 80%. And I tried another image2code task, the rendered images were much worse, too.
I think most of the case is due to your not using the right chat template. And obviously, you used the wrong one. But could @mickqian take a look?
@zhaochenyang20
I assumed the engine will process the default chat template correctly, like vllm or tgi.
Below is the client code I used, no template realted param. What did I miss?
class LLMClient:
def __init__(
self,
url: str = "http://10.196.164.32:23333/v1",
max_tokens: int = 2000,
frequency_penalty=0.0,
model_name: str = None,
stop: List[str] = None,
):
openai_api_key = os.getenv("OPENAI_SK", "xxx")
self.client = OpenAI(api_key=openai_api_key, base_url=url, max_retries=4)
self.max_tokens = max_tokens
if model_name is None:
self.model_name = self.client.models.list().data[0].id
else:
self.model_name = model_name
self.frequency_penalty = frequency_penalty
self.stop = stop
def generate(self, image, prompt):
image_base64 = encode_image_base64(image)
response = self.client.chat.completions.create(
model=self.model_name,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt,
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}",
},
},
],
}
],
temperature=0.0,
frequency_penalty=self.frequency_penalty,
max_tokens=self.max_tokens,
stop=self.stop,
)
return response.choices[0].message.content
https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template
Go through the whole docs @heibaidaolx123
@zhaochenyang20
Oh, I missed the chat template. Thanks.
By adding --chat-tempalte qwen2-vl, the result gets better, but still lags behind that of transfomers (acc 83% vs 87%).
Any clue?
Let me ask for help from our multi-modal people.
Hi @heibaidaolx123 This PR maybe related, https://github.com/sgl-project/sglang/pull/3605, could you have a try? And we also try to integrate a benchmark to set a baseline here https://github.com/sgl-project/sglang/pull/3562
The problems of Qwen2.5 VL might be related to:
- the image process procedure which is not included in hf image_processor
- the rotary position embedding of Vit
Hi @heibaidaolx123 This PR maybe related, #3605, could you have a try? And we also try to integrate a benchmark to set a baseline here #3562
@yizhang2077 I tried the pr. The output changed a little, and the acc remains the same.
Same problem for me. For my key-value extraction task, the accuracy dropped ~8% compared to that from transformers.
I checked the input prompt for both transformers and sgl. They are exactly the same as I printed them out including the format and tokens.
Hi SGLang team - same issue here. Qwen 2.5 VL on sglang gives worse results.
can we have a fix?
cc @yizhang2077 @mickqian
continuing https://github.com/sgl-project/sglang/issues/4645#issuecomment-2754992234
This is noticed. We will submit a fix asap
@mickqian Hi, is this fixed?
Any update? @mickqian
Any update? @mickqian
Hi all, there have been some fixes regarding qwen-vl models recently. Could you test with latest release?
@mickqian The result is good with official image v0.4.6