sglang
sglang copied to clipboard
[WIP] Support qwen2 vl model
Motivation
This PR adding support for Qwen2-VL model, which is also supported by vllm (here) and Imdeploy (here)
Modifications
- add conversation template and chat template for Qwen2-vl,which is refereced by here
- add image processor for Qwen2VL since the output of processor are different (for example, Qwen2VL image processor will output image_grid_thws as result), and add member image_grid_thws into ImageInputs
- compute mrope positions for each request by using MRotaryEmbedding in vllm, and record result in InputMeta . QWen2VL model will use them as real positions
- copy qwen2_vl.py from vllm and make some adaptions, make some modifications in pad_input_ids function, which will replace image_token_id in input_id by unique image_hash
- add test about qwen2vl, which is largely copied from test_vision_openai_server.py
Checklist
- [x] Format your code according to the Contributor Guide.
- [x] Add unit tests as outlined in the Contributor Guide.
- [ ] Update documentation as needed, including docstrings or example
Others
- [ ] There is a bug in transformers about Qwen2VLConfig (here), vllm fix it in (here) while it is still not released to pip. SGLang has dependency about vllm. As a result, if we do not use latest vllm or make some modification for transformers, we can not run QWen2VL model correctly.
- [ ] mrope relies on MRotaryEmbedding in vllm, which is not supported in vllm 0.5.5, so maybe we need update vllm version