sglang [WIP] Support qwen2 vl model

[WIP] Support qwen2 vl model

Open yizhang2077 opened this issue 4 months ago • 2 comments

This PR adding support for Qwen2-VL model, which is also supported by vllm (here) and Imdeploy (here)

add conversation template and chat template for Qwen2-vl，which is refereced by here
add image processor for Qwen2VL since the output of processor are different (for example, Qwen2VL image processor will output image_grid_thws as result), and add member image_grid_thws into ImageInputs
compute mrope positions for each request by using MRotaryEmbedding in vllm, and record result in InputMeta . QWen2VL model will use them as real positions
copy qwen2_vl.py from vllm and make some adaptions, make some modifications in pad_input_ids function, which will replace image_token_id in input_id by unique image_hash
add test about qwen2vl, which is largely copied from test_vision_openai_server.py

[ ] There is a bug in transformers about Qwen2VLConfig (here), vllm fix it in (here) while it is still not released to pip. SGLang has dependency about vllm. As a result, if we do not use latest vllm or make some modification for transformers, we can not run QWen2VL model correctly.
[ ] mrope relies on MRotaryEmbedding in vllm, which is not supported in vllm 0.5.5, so maybe we need update vllm version

Sep 30 '24 19:09 yizhang2077