sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[WIP] Support qwen2 vl model

Open yizhang2077 opened this issue 4 months ago • 2 comments

Motivation

This PR adding support for Qwen2-VL model, which is also supported by vllm (here) and Imdeploy (here)

Modifications

  1. add conversation template and chat template for Qwen2-vl,which is refereced by here
  2. add image processor for Qwen2VL since the output of processor are different (for example, Qwen2VL image processor will output image_grid_thws as result), and add member image_grid_thws into ImageInputs
  3. compute mrope positions for each request by using MRotaryEmbedding in vllm, and record result in InputMeta . QWen2VL model will use them as real positions
  4. copy qwen2_vl.py from vllm and make some adaptions, make some modifications in pad_input_ids function, which will replace image_token_id in input_id by unique image_hash
  5. add test about qwen2vl, which is largely copied from test_vision_openai_server.py

Checklist

  • [x] Format your code according to the Contributor Guide.
  • [x] Add unit tests as outlined in the Contributor Guide.
  • [ ] Update documentation as needed, including docstrings or example

Others

  • [ ] There is a bug in transformers about Qwen2VLConfig (here), vllm fix it in (here) while it is still not released to pip. SGLang has dependency about vllm. As a result, if we do not use latest vllm or make some modification for transformers, we can not run QWen2VL model correctly.
  • [ ] mrope relies on MRotaryEmbedding in vllm, which is not supported in vllm 0.5.5, so maybe we need update vllm version

yizhang2077 avatar Sep 30 '24 19:09 yizhang2077