vllm
vllm copied to clipboard
[Model][VLM] Add Qwen2-VL model support
This PR adding support for Qwen2-VL model.
FIX #8139 FIX #8281
Requirements
- This PR requires
transformerswith this PR merged and this bugfix PR merged (You can install it viapip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830).
Optional Requirements
- When constructing LLM inputs, we recommend using our helper package
qwen-vl-utilsto preprocess multimodal content correctly (qwen-vl-utilsis not a part of this PR).
Example Usage
from PIL import Image
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = 'Qwen/Qwen2-VL-7B-Instruct'
IMAGE_PATH = '/path/to/image.jpg'
VIDEO_PATH = '/path/to/video.mp4'
llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={'image': 10, 'video': 10},
)
sampling_params = SamplingParams(
temperature=0.1, top_p=0.001, repetition_penalty=1.05, max_tokens=256,
stop_token_ids=[],
)
messages = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': [
{
'type': 'image',
'image': IMAGE_PATH,
# min_pixels & max_pixels are optional
'max_pixels': 12845056,
},
# You can also pass one or more videos:
# {
# 'type': 'video',
# 'video': VIDEO_PATH,
# }
{
'type': 'text',
'text': 'What does this diagram illustrate?',
},
]},
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
mm_data = {}
if image_inputs is not None:
mm_data['image'] = image_inputs
if video_inputs is not None:
mm_data['video'] = video_inputs
llm_inputs = {
'prompt': prompt,
'multi_modal_data': mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
Notes
Here are some important notes about this PR:
-
Qwen2-VL uses rotary embedding with multimodal sections (
mrope) (seevllm/model_executor/layers/rotary_embedding.pyfor more details). This rotary embedding requires the inputpositionsto be a tensor of shape(3, seq_len)(instead of(seq_len,)in common case).- To support this feature, we add a new
_mrope_position_delta(with typeOptional[int]) attribute intovllm.sequence.SequenceData(this attribute is used to computemrope_input_positionsin each decoding step). (If reviewers have a better solution, please comment in this PR) - We also change
model_runner.pyto compute themrope_input_positionswhen the model usesmrope. Other model runners should also follow this logic, I think this can be done in another PR (I will add this part if reviewers thinks it needs to be implemented in this PR).
- To support this feature, we add a new
-
Qwen2-VL uses
flash-attn==2.6.1(instead ofvllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 invllm/model_executor/models/qwen2_vl.py). Currentvllm-flash-attnversion will outputNaNlogits value, and I am still debugging this bug.- UPDATE 2024.09.06: Add
xformersbackend as a fallback implementation ofQwen2VisionAttention, so there is no need to addflash-attninto project requirements file.
- UPDATE 2024.09.06: Add
-
Qwen2-VL supports both image and video inputs. To support this feature, we add a
videomultimodal plugin (seevllm/multimodal/video.pyfor more details). -
OpenAI-compatible server
- Currently,
vllm.entrypoints.openai.api_serveruses a model-independent multimodal data fetcher (e.g.vllm.multimodal.utils.async_get_and_parse_image), so vision smart resizing logic inqwen-vl-utilscannot be applied now. I think its good to create another PR to fix it later.
- Currently,
-
Multiple modalities support details
Since Qwen2-VL support two modalities (images and videos), we should handle some special cases as below:
# 1. A batch with two samples, sample 1 contains images, sample 2 contains videos llm.generate([ { "prompt": "XXX", "multi_modal_data": { "image": ... } }, { "prompt": "XXX", "multi_modal_data": { "video": ... } } ]) # 2. A single sample with both images and videos llm.generate([ { "prompt": "XXX", "multi_modal_data": { "image": ..., "video": ... } } ])So I remove the key same check in
vllm.multimodal.base.MultiModalInputs.batch()method, since different samples may returns different modality keys.
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.
In the meantime, can you fix the CI failures?
Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.
In the meantime, can you fix the CI failures?
Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?
Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?
Can you merge from main first? It fixes some of the mypy errors which might apply here.
Hi @DarkLight1337 @ywang96 , I have updated this PR based on your review comments, please check it again. I also add some notes about multiple modalities in the PR overview.
@fyabc Hi, can this patch support mutiple images in one prompt like follows:
Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.
@fyabc Hi, can this patch support mutiple images in one prompt like follows:
Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.
Hi @DragonFive , you can pass multiple images into a single prompt like this:
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]
See "Multi image inference" section of our README for more details.
@DarkLight1337 I compared this PR to #6571, and actually IMO it probably makes more sense to have this PR merged first since the video/image util is from a separate python package because of qwen_vl_utils, versus 6571 that might introduce opencv as an additional dependency. What do you think?
We also need to think about the eventual API protocol for the OpenAI API server though it is not in the scope of this PR, but this PR does introduce a slightly different messages format than actual OpenAI API.
@DarkLight1337 I compared this PR to #6571, and actually IMO it probably makes more sense to have this PR merged first since the video/image util is from a separate python package because of
qwen_vl_utils, versus 6571 that might introduce opencv as an additional dependency. What do you think?We also need to think about the eventual API protocol for the OpenAI API server though it is not in the scope of this PR, but this PR does introduce a slightly different
messagesformat than actual OpenAI API.
Give me some time to take a closer look at this PR first.
The use of qwen-vl-utils is quite different from the existing models which fully rely on the AutoProcessor from HuggingFace. Is there a particular reason why the preprocessing logic for this model is being split across AutoProcessor and qwen-vl-utils?
I did some local testing on this PR and it's working well for both .jpg and .mp4 inputs, on TP=1 and TP=2.
Note that I did run into dependency issue when running the video inference
[rank0]: ImportError: PyAV is not installed, and is necessary for the video operations in torchvision.
so this PR will also introduce new dependency like #7559. Perhaps it's a good idea to have a vllm[video] optional dependency path for all video-related dependencies.
I did some local testing on this PR and it's working well for both
.jpgand.mp4inputs, on TP=1 and TP=2.Note that I did run into dependency issue when running the video inference
[rank0]: ImportError: PyAV is not installed, and is necessary for the video operations in torchvision.so this PR will also introduce new dependency like #7559. Perhaps it's a good idea to have a
vllm[video]optional dependency path for all video-related dependencies.
thank you for your testing! I forgot this dependency and I will add it later.
I happened to notice something while following this PR. I've merged it locally to run some tests, and during testing, I encountered a strange issue where, after several successful executions, only '!' is repeatedly output. Initially, everything runs smoothly, but after repeated runs, this symptom appears.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://sungyesa.com/new/data/file/secret/988486029_UV9Hq6Zt_IMG_8053.jpeg"
}
},
{
"type": "text",
"text": "Describe this image"
}
]
}
],
"max_tokens": 4096,
"temperature": 0.0,
"stream": false
}'
Failure case example
{"id":"chat-9a419b2e12bc405fa58b4a667709b5c9","object":"chat.completion","created":1725205065,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":3332,"total_tokens":7428,"completion_tokens":4096},"prompt_logprobs":null}
Success case example (tested with a different image)
{"id":"chat-6acd666dca494385b6c7fd335f71d332","object":"chat.completion","created":1725203421,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a person and a dog sitting on a sandy beach at sunset. The person is wearing a plaid shirt and dark pants, and they are sitting cross-legged on the sand. The dog, which appears to be a large breed, is wearing a harness and is sitting next to the person, extending its paw towards the person's hand. The dog's paw is touching the person's hand, creating a playful and affectionate moment. The background shows the ocean with gentle waves and a clear sky, with the sun setting on the horizon, casting a warm, golden light over the scene. The overall atmosphere is serene and joyful, capturing a special bond between the person and their pet.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":3603,"total_tokens":3743,"completion_tokens":140},"prompt_logprobs":null}
One line that I changed when merging the PR is as follows. Could this have affected the issue?
vllm/model_executor/models/qwen2_vl.py
from vllm_flash_attn.flash_attn_interface import flash_attn_varlen_func
# from flash_attn import flash_attn_varlen_func
I happened to notice something while following this PR. I've merged it locally to run some tests, and during testing, I encountered a strange issue where, after several successful executions, only '!' is repeatedly output. Initially, everything runs smoothly, but after repeated runs, this symptom appears.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen2-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://sungyesa.com/new/data/file/secret/988486029_UV9Hq6Zt_IMG_8053.jpeg" } }, { "type": "text", "text": "Describe this image" } ] } ], "max_tokens": 4096, "temperature": 0.0, "stream": false }'Failure case example
{"id":"chat-9a419b2e12bc405fa58b4a667709b5c9","object":"chat.completion","created":1725205065,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":3332,"total_tokens":7428,"completion_tokens":4096},"prompt_logprobs":null}Success case example (tested with a different image)
{"id":"chat-6acd666dca494385b6c7fd335f71d332","object":"chat.completion","created":1725203421,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a person and a dog sitting on a sandy beach at sunset. The person is wearing a plaid shirt and dark pants, and they are sitting cross-legged on the sand. The dog, which appears to be a large breed, is wearing a harness and is sitting next to the person, extending its paw towards the person's hand. The dog's paw is touching the person's hand, creating a playful and affectionate moment. The background shows the ocean with gentle waves and a clear sky, with the sun setting on the horizon, casting a warm, golden light over the scene. The overall atmosphere is serene and joyful, capturing a special bond between the person and their pet.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":3603,"total_tokens":3743,"completion_tokens":140},"prompt_logprobs":null}One line that I changed when merging the PR is as follows. Could this have affected the issue?
vllm/model_executor/models/qwen2_vl.pyfrom vllm_flash_attn.flash_attn_interface import flash_attn_varlen_func # from flash_attn import flash_attn_varlen_func
This is probably related to the observation from the author as well:
Qwen2-VL uses flash-attn==2.6.1 (instead of vllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 in vllm/model_executor/models/qwen2_vl.py). Current vllm-flash-attn version will output NaN logits value, and I am still debugging this bug.
Have you observed the same issue with flash-attn? @smartdolphin
The use of
qwen-vl-utilsis quite different from the existing models which fully rely on theAutoProcessorfrom HuggingFace. Is there a particular reason why the preprocessing logic for this model is being split acrossAutoProcessorandqwen-vl-utils?
@DarkLight1337
The goal of qwen-vl-utils is to allow users to control the parameters of each single image/video (min_pixels, max_pixels, fps, etc.)
Current transformers & vllm implementation can run without the qwen-vl-utils package, but at this time users can only globally control those parameters by modifying preprocessor_config.json.
When we merging Qwen2-VL into transformers, we have some discussions with reviewers:
https://github.com/huggingface/transformers/pull/32318#discussion_r1699673586
https://github.com/huggingface/transformers/pull/32318#discussion_r1699681579
finally, we follow the usual trend (passing processed PIL/arrays to AutoProcessor), and provided the qwen-vl-utils package for users who need more precise control.
I happened to notice something while following this PR. I've merged it locally to run some tests, and during testing, I encountered a strange issue where, after several successful executions, only '!' is repeatedly output. Initially, everything runs smoothly, but after repeated runs, this symptom appears.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen2-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://sungyesa.com/new/data/file/secret/988486029_UV9Hq6Zt_IMG_8053.jpeg" } }, { "type": "text", "text": "Describe this image" } ] } ], "max_tokens": 4096, "temperature": 0.0, "stream": false }'Failure case example
{"id":"chat-9a419b2e12bc405fa58b4a667709b5c9","object":"chat.completion","created":1725205065,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":3332,"total_tokens":7428,"completion_tokens":4096},"prompt_logprobs":null}Success case example (tested with a different image)
{"id":"chat-6acd666dca494385b6c7fd335f71d332","object":"chat.completion","created":1725203421,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a person and a dog sitting on a sandy beach at sunset. The person is wearing a plaid shirt and dark pants, and they are sitting cross-legged on the sand. The dog, which appears to be a large breed, is wearing a harness and is sitting next to the person, extending its paw towards the person's hand. The dog's paw is touching the person's hand, creating a playful and affectionate moment. The background shows the ocean with gentle waves and a clear sky, with the sun setting on the horizon, casting a warm, golden light over the scene. The overall atmosphere is serene and joyful, capturing a special bond between the person and their pet.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":3603,"total_tokens":3743,"completion_tokens":140},"prompt_logprobs":null}One line that I changed when merging the PR is as follows. Could this have affected the issue?
vllm/model_executor/models/qwen2_vl.pyfrom vllm_flash_attn.flash_attn_interface import flash_attn_varlen_func # from flash_attn import flash_attn_varlen_funcThis is probably related to the observation from the author as well:
Qwen2-VL uses flash-attn==2.6.1 (instead of vllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 in vllm/model_executor/models/qwen2_vl.py). Current vllm-flash-attn version will output NaN logits value, and I am still debugging this bug.
Have you observed the same issue with flash-attn? @smartdolphin
Yes @ywang96 @smartdolphin , vllm-flash-attn==2.6.1 will output NaN logits in some cases, and NaN logits will cause '!!!' outputs (NaN => output token_id == 0 => '!').
I haven't found the cause of this bug yet...
The use of
qwen-vl-utilsis quite different from the existing models which fully rely on theAutoProcessorfrom HuggingFace. Is there a particular reason why the preprocessing logic for this model is being split acrossAutoProcessorandqwen-vl-utils?@DarkLight1337 The goal of
qwen-vl-utilsis to allow users to control the parameters of each single image/video (min_pixels,max_pixels,fps, etc.) Current transformers & vllm implementation can run without theqwen-vl-utilspackage, but at this time users can only globally control those parameters by modifyingpreprocessor_config.json.When we merging Qwen2-VL into
transformers, we have some discussions with reviewers: huggingface/transformers#32318 (comment) huggingface/transformers#32318 (comment) finally, we follow the usual trend (passing processed PIL/arrays toAutoProcessor), and provided theqwen-vl-utilspackage for users who need more precise control.
I see, thanks for the detailed explanation! It looks like qwen-vl-utils acts like vllm.entrypoints.chat_utils in that regard.
How long will it take to merge?
We need to wait for transformers to update their version.
support qwen2 vl awq model?
I found that GPTQ quantization will prompt the following error. If I skip these weight readings according to the qwen2 code, it will run. Is it reasonable to merge such changes directly into qwen2-vl? KeyError: 'model.layers.0.mlp.down_proj.bias'
support qwen2 vl awq model?
@seanzhang-zhichen Yes, this PR support AWQ model. You can check this issue for more details.
I found that GPTQ quantization will prompt the following error. If I skip these weight readings according to the qwen2 code, it will run. Is it reasonable to merge such changes directly into qwen2-vl? KeyError: 'model.layers.0.mlp.down_proj.bias'
Hi @pengxuan2022 , the previous version of Qwen2-VL-xxB-GPTQ-IntXX models have unused biases, we have updated the model parameters, and you can download the new weights from HuggingFace. You can check this issue for more details.
@WoosukKwon please take a look at the mrope implementation in this PR when you get the chance. I saw that you were refactoring rope classes so perhaps you could take this into account and help integrate mrope into the existing framework.
Huge thanks for the PR! We are very happy to support this model very soon :)
I just left some comments mostly requiring some clarifications about mrope. While we don't have to optimize it in this PR, it'd be nice to have a clean and well-documented implementation so that anyone can optimize it easily in the future.
Thank you for your comments! I will update this PR later.
tensor_parallel_size > 1 failed on rank > 0 workers:
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
Hi @DarkLight1337 @ywang96 @WoosukKwon , I have updated this PR according to review comments, please check it again:
- Add
xformersbackend ofQwen2VisionAttention, remove the dependency onflash-attn. - Refactor
MRotaryEmbedding.forwardto align withRotaryEmbedding.
I found that vllm's text model directly reuses Qwen2 code. In hf code, the text model position encoding function apply_multimodal_rotary_pos_emb of qwen2vl and the apply_rotary_pos_emb code of qwen2 are somewhat different. Are the position encoding of the text model and the position encoding of qwen2 the same?
I found that vllm's text model directly reuses Qwen2 code. In hf code, the text model position encoding function apply_multimodal_rotary_pos_emb of qwen2vl and the apply_rotary_pos_emb code of qwen2 are somewhat different. Are the position encoding of the text model and the position encoding of qwen2 the same?
@pengxuan2022 Hi, in Qwen2Model -> Qwen2Attention, we use get_rope function to initialize rotary embedding module from checkpoint config. So text model in Qwen2-VL also uses mrope rotary position embeddings, this behavior is consistent with HF.
The VLM part looks good to me overall. Just a few things left to do:
- Could you add a test case (under
tests/) to verify that the model works?- Having an example (inside
examples/offline_inference_vision_language.pyor its multi-image counterpart) would be great so that people won't have to refer to this PR directly.
@DarkLight1337 Hi, if I want to use qwen-vl-utils package inside example, should I add it into requirements-test.txt (or directly add into requirements-common.txt)?
Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?