sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Qwen2.5 VL sglang's output much worse than transformers

Open heibaidaolx123 opened this issue 10 months ago • 8 comments

I tried serving qwen2.5 vl 72B using sglang on a node with 4*A40 GPUs. The image I used is the official sglang:v0.4.3.post2-cu125 The command:

python3 -m sglang.launch_server \
  --tp $NUM_SHARD \
  --mem-fraction-static 0.99 \
  --disable-cuda-graph \
  --model-path /model/Qwen2.5-VL-72B-Instruct \
  --host 0.0.0.0 \
  --port 23333

I tested using an internal image classification dataset, the results were much worse than when using transformers, acc droped from 87% to 80%. And I tried another image2code task, the rendered images were much worse, too.

heibaidaolx123 avatar Feb 21 '25 06:02 heibaidaolx123

I think most of the case is due to your not using the right chat template. And obviously, you used the wrong one. But could @mickqian take a look?

zhaochenyang20 avatar Feb 21 '25 08:02 zhaochenyang20

@zhaochenyang20

I assumed the engine will process the default chat template correctly, like vllm or tgi.

Below is the client code I used, no template realted param. What did I miss?

class LLMClient:
    def __init__(
        self,
        url: str = "http://10.196.164.32:23333/v1",
        max_tokens: int = 2000,
        frequency_penalty=0.0,
        model_name: str = None,
        stop: List[str] = None,
    ):
        openai_api_key = os.getenv("OPENAI_SK", "xxx")
        self.client = OpenAI(api_key=openai_api_key, base_url=url, max_retries=4)
        self.max_tokens = max_tokens
        if model_name is None:
            self.model_name = self.client.models.list().data[0].id
        else:
            self.model_name = model_name
        self.frequency_penalty = frequency_penalty
        self.stop = stop

    def generate(self, image, prompt):
        image_base64 = encode_image_base64(image)
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt,
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_base64}",
                            },
                        },
                    ],
                }
            ],
            temperature=0.0,
            frequency_penalty=self.frequency_penalty,
            max_tokens=self.max_tokens,
            stop=self.stop,
        )
        return response.choices[0].message.content

heibaidaolx123 avatar Feb 21 '25 08:02 heibaidaolx123

https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template

Go through the whole docs @heibaidaolx123

zhaochenyang20 avatar Feb 21 '25 08:02 zhaochenyang20

@zhaochenyang20 Oh, I missed the chat template. Thanks. By adding --chat-tempalte qwen2-vl, the result gets better, but still lags behind that of transfomers (acc 83% vs 87%). Any clue?

heibaidaolx123 avatar Feb 21 '25 09:02 heibaidaolx123

Let me ask for help from our multi-modal people.

zhaochenyang20 avatar Feb 21 '25 18:02 zhaochenyang20

Hi @heibaidaolx123 This PR maybe related, https://github.com/sgl-project/sglang/pull/3605, could you have a try? And we also try to integrate a benchmark to set a baseline here https://github.com/sgl-project/sglang/pull/3562

yizhang2077 avatar Feb 21 '25 18:02 yizhang2077

The problems of Qwen2.5 VL might be related to:

  1. the image process procedure which is not included in hf image_processor
  2. the rotary position embedding of Vit

mickqian avatar Feb 22 '25 00:02 mickqian

Hi @heibaidaolx123 This PR maybe related, #3605, could you have a try? And we also try to integrate a benchmark to set a baseline here #3562

@yizhang2077 I tried the pr. The output changed a little, and the acc remains the same.

heibaidaolx123 avatar Feb 22 '25 03:02 heibaidaolx123

Same problem for me. For my key-value extraction task, the accuracy dropped ~8% compared to that from transformers.

I checked the input prompt for both transformers and sgl. They are exactly the same as I printed them out including the format and tokens.

KaiKin-C avatar Mar 10 '25 12:03 KaiKin-C

Hi SGLang team - same issue here. Qwen 2.5 VL on sglang gives worse results.

can we have a fix?

groklab avatar Mar 26 '25 16:03 groklab

cc @yizhang2077 @mickqian

continuing https://github.com/sgl-project/sglang/issues/4645#issuecomment-2754992234

adarshxs avatar Mar 26 '25 17:03 adarshxs

This is noticed. We will submit a fix asap

mickqian avatar Mar 29 '25 01:03 mickqian

@mickqian Hi, is this fixed?

heibaidaolx123 avatar Apr 10 '25 11:04 heibaidaolx123

Any update? @mickqian

XiaobingSuper avatar Apr 17 '25 03:04 XiaobingSuper

Any update? @mickqian

JoursBleu avatar Apr 22 '25 08:04 JoursBleu

Hi all, there have been some fixes regarding qwen-vl models recently. Could you test with latest release?

mickqian avatar Apr 26 '25 03:04 mickqian

@mickqian The result is good with official image v0.4.6

heibaidaolx123 avatar Apr 28 '25 06:04 heibaidaolx123