vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Model][VLM] Add Qwen2-VL model support

Open fyabc opened this issue 1 year ago • 95 comments

This PR adding support for Qwen2-VL model.

FIX #8139 FIX #8281

Requirements

  • This PR requires transformers with this PR merged and this bugfix PR merged (You can install it via pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830).

Optional Requirements

  • When constructing LLM inputs, we recommend using our helper package qwen-vl-utils to preprocess multimodal content correctly (qwen-vl-utils is not a part of this PR).

Example Usage

from PIL import Image
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = 'Qwen/Qwen2-VL-7B-Instruct'
IMAGE_PATH = '/path/to/image.jpg'
VIDEO_PATH = '/path/to/video.mp4'

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={'image': 10, 'video': 10},
)

sampling_params = SamplingParams(
    temperature=0.1, top_p=0.001, repetition_penalty=1.05, max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': [
        {
            'type': 'image',
            'image': IMAGE_PATH,

            # min_pixels & max_pixels are optional
            'max_pixels': 12845056,
        },

        # You can also pass one or more videos:
        # {
        #     'type': 'video',
        #     'video': VIDEO_PATH,
        # }

        {
            'type': 'text',
            'text': 'What does this diagram illustrate?',
        },
    ]},
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data['image'] = image_inputs
if video_inputs is not None:
    mm_data['video'] = video_inputs

llm_inputs = {
    'prompt': prompt,
    'multi_modal_data': mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

Notes

Here are some important notes about this PR:

  1. Qwen2-VL uses rotary embedding with multimodal sections (mrope) (see vllm/model_executor/layers/rotary_embedding.py for more details). This rotary embedding requires the input positions to be a tensor of shape (3, seq_len) (instead of (seq_len,) in common case).

    1. To support this feature, we add a new _mrope_position_delta (with type Optional[int]) attribute into vllm.sequence.SequenceData (this attribute is used to compute mrope_input_positions in each decoding step). (If reviewers have a better solution, please comment in this PR)
    2. We also change model_runner.py to compute the mrope_input_positions when the model uses mrope. Other model runners should also follow this logic, I think this can be done in another PR (I will add this part if reviewers thinks it needs to be implemented in this PR).
  2. Qwen2-VL uses flash-attn==2.6.1 (instead of vllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 in vllm/model_executor/models/qwen2_vl.py). Current vllm-flash-attn version will output NaN logits value, and I am still debugging this bug.

    1. UPDATE 2024.09.06: Add xformers backend as a fallback implementation of Qwen2VisionAttention, so there is no need to add flash-attn into project requirements file.
  3. Qwen2-VL supports both image and video inputs. To support this feature, we add a video multimodal plugin (see vllm/multimodal/video.py for more details).

  4. OpenAI-compatible server

    1. Currently, vllm.entrypoints.openai.api_server uses a model-independent multimodal data fetcher (e.g. vllm.multimodal.utils.async_get_and_parse_image), so vision smart resizing logic in qwen-vl-utils cannot be applied now. I think its good to create another PR to fix it later.
  5. Multiple modalities support details

    Since Qwen2-VL support two modalities (images and videos), we should handle some special cases as below:

    # 1. A batch with two samples, sample 1 contains images, sample 2 contains videos
    llm.generate([
        {
            "prompt": "XXX",
            "multi_modal_data": {
                "image": ...
            }
        },
        {
            "prompt": "XXX",
            "multi_modal_data": {
                "video": ...
            }
        }
    ])
    
    # 2. A single sample with both images and videos
    llm.generate([
        {
            "prompt": "XXX",
            "multi_modal_data": {
                "image": ...,
                "video": ...
            }
        }
    ])
    

    So I remove the key same check in vllm.multimodal.base.MultiModalInputs.batch() method, since different samples may returns different modality keys.

fyabc avatar Aug 27 '24 09:08 fyabc

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Aug 27 '24 09:08 github-actions[bot]

Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.

In the meantime, can you fix the CI failures?

DarkLight1337 avatar Aug 29 '24 03:08 DarkLight1337

Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.

In the meantime, can you fix the CI failures?

image Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?

fyabc avatar Aug 29 '24 04:08 fyabc

image Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?

Can you merge from main first? It fixes some of the mypy errors which might apply here.

DarkLight1337 avatar Aug 29 '24 04:08 DarkLight1337

Hi @DarkLight1337 @ywang96 , I have updated this PR based on your review comments, please check it again. I also add some notes about multiple modalities in the PR overview.

fyabc avatar Aug 29 '24 08:08 fyabc

@fyabc Hi, can this patch support mutiple images in one prompt like follows:

Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.

DragonFive avatar Aug 30 '24 07:08 DragonFive

@fyabc Hi, can this patch support mutiple images in one prompt like follows:

Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.

Hi @DragonFive , you can pass multiple images into a single prompt like this:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

See "Multi image inference" section of our README for more details.

fyabc avatar Aug 30 '24 11:08 fyabc

@DarkLight1337 I compared this PR to #6571, and actually IMO it probably makes more sense to have this PR merged first since the video/image util is from a separate python package because of qwen_vl_utils, versus 6571 that might introduce opencv as an additional dependency. What do you think?

We also need to think about the eventual API protocol for the OpenAI API server though it is not in the scope of this PR, but this PR does introduce a slightly different messages format than actual OpenAI API.

ywang96 avatar Sep 01 '24 06:09 ywang96

@DarkLight1337 I compared this PR to #6571, and actually IMO it probably makes more sense to have this PR merged first since the video/image util is from a separate python package because of qwen_vl_utils, versus 6571 that might introduce opencv as an additional dependency. What do you think?

We also need to think about the eventual API protocol for the OpenAI API server though it is not in the scope of this PR, but this PR does introduce a slightly different messages format than actual OpenAI API.

Give me some time to take a closer look at this PR first.

DarkLight1337 avatar Sep 01 '24 07:09 DarkLight1337

The use of qwen-vl-utils is quite different from the existing models which fully rely on the AutoProcessor from HuggingFace. Is there a particular reason why the preprocessing logic for this model is being split across AutoProcessor and qwen-vl-utils?

DarkLight1337 avatar Sep 01 '24 09:09 DarkLight1337

I did some local testing on this PR and it's working well for both .jpg and .mp4 inputs, on TP=1 and TP=2.

Note that I did run into dependency issue when running the video inference

[rank0]: ImportError: PyAV is not installed, and is necessary for the video operations in torchvision.

so this PR will also introduce new dependency like #7559. Perhaps it's a good idea to have a vllm[video] optional dependency path for all video-related dependencies.

ywang96 avatar Sep 01 '24 10:09 ywang96

I did some local testing on this PR and it's working well for both .jpg and .mp4 inputs, on TP=1 and TP=2.

Note that I did run into dependency issue when running the video inference

[rank0]: ImportError: PyAV is not installed, and is necessary for the video operations in torchvision.

so this PR will also introduce new dependency like #7559. Perhaps it's a good idea to have a vllm[video] optional dependency path for all video-related dependencies.

thank you for your testing! I forgot this dependency and I will add it later.

fyabc avatar Sep 01 '24 11:09 fyabc

I happened to notice something while following this PR. I've merged it locally to run some tests, and during testing, I encountered a strange issue where, after several successful executions, only '!' is repeatedly output. Initially, everything runs smoothly, but after repeated runs, this symptom appears.

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2-VL-7B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://sungyesa.com/new/data/file/secret/988486029_UV9Hq6Zt_IMG_8053.jpeg"
          }
        },
        {
          "type": "text",
          "text": "Describe this image"
        }
      ]
    }
  ],
  "max_tokens": 4096,
  "temperature": 0.0,
  "stream": false
}'

Failure case example

{"id":"chat-9a419b2e12bc405fa58b4a667709b5c9","object":"chat.completion","created":1725205065,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":3332,"total_tokens":7428,"completion_tokens":4096},"prompt_logprobs":null}

Success case example (tested with a different image)

{"id":"chat-6acd666dca494385b6c7fd335f71d332","object":"chat.completion","created":1725203421,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a person and a dog sitting on a sandy beach at sunset. The person is wearing a plaid shirt and dark pants, and they are sitting cross-legged on the sand. The dog, which appears to be a large breed, is wearing a harness and is sitting next to the person, extending its paw towards the person's hand. The dog's paw is touching the person's hand, creating a playful and affectionate moment. The background shows the ocean with gentle waves and a clear sky, with the sun setting on the horizon, casting a warm, golden light over the scene. The overall atmosphere is serene and joyful, capturing a special bond between the person and their pet.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":3603,"total_tokens":3743,"completion_tokens":140},"prompt_logprobs":null}

One line that I changed when merging the PR is as follows. Could this have affected the issue?

vllm/model_executor/models/qwen2_vl.py

from vllm_flash_attn.flash_attn_interface import flash_attn_varlen_func
# from flash_attn import flash_attn_varlen_func

smartdolphin avatar Sep 01 '24 15:09 smartdolphin

I happened to notice something while following this PR. I've merged it locally to run some tests, and during testing, I encountered a strange issue where, after several successful executions, only '!' is repeatedly output. Initially, everything runs smoothly, but after repeated runs, this symptom appears.

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2-VL-7B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://sungyesa.com/new/data/file/secret/988486029_UV9Hq6Zt_IMG_8053.jpeg"
          }
        },
        {
          "type": "text",
          "text": "Describe this image"
        }
      ]
    }
  ],
  "max_tokens": 4096,
  "temperature": 0.0,
  "stream": false
}'

Failure case example

{"id":"chat-9a419b2e12bc405fa58b4a667709b5c9","object":"chat.completion","created":1725205065,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":3332,"total_tokens":7428,"completion_tokens":4096},"prompt_logprobs":null}

Success case example (tested with a different image)

{"id":"chat-6acd666dca494385b6c7fd335f71d332","object":"chat.completion","created":1725203421,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a person and a dog sitting on a sandy beach at sunset. The person is wearing a plaid shirt and dark pants, and they are sitting cross-legged on the sand. The dog, which appears to be a large breed, is wearing a harness and is sitting next to the person, extending its paw towards the person's hand. The dog's paw is touching the person's hand, creating a playful and affectionate moment. The background shows the ocean with gentle waves and a clear sky, with the sun setting on the horizon, casting a warm, golden light over the scene. The overall atmosphere is serene and joyful, capturing a special bond between the person and their pet.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":3603,"total_tokens":3743,"completion_tokens":140},"prompt_logprobs":null}

One line that I changed when merging the PR is as follows. Could this have affected the issue?

vllm/model_executor/models/qwen2_vl.py

from vllm_flash_attn.flash_attn_interface import flash_attn_varlen_func
# from flash_attn import flash_attn_varlen_func

This is probably related to the observation from the author as well:

Qwen2-VL uses flash-attn==2.6.1 (instead of vllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 in vllm/model_executor/models/qwen2_vl.py). Current vllm-flash-attn version will output NaN logits value, and I am still debugging this bug.

Have you observed the same issue with flash-attn? @smartdolphin

ywang96 avatar Sep 01 '24 19:09 ywang96

The use of qwen-vl-utils is quite different from the existing models which fully rely on the AutoProcessor from HuggingFace. Is there a particular reason why the preprocessing logic for this model is being split across AutoProcessor and qwen-vl-utils?

@DarkLight1337 The goal of qwen-vl-utils is to allow users to control the parameters of each single image/video (min_pixels, max_pixels, fps, etc.) Current transformers & vllm implementation can run without the qwen-vl-utils package, but at this time users can only globally control those parameters by modifying preprocessor_config.json.

When we merging Qwen2-VL into transformers, we have some discussions with reviewers: https://github.com/huggingface/transformers/pull/32318#discussion_r1699673586 https://github.com/huggingface/transformers/pull/32318#discussion_r1699681579 finally, we follow the usual trend (passing processed PIL/arrays to AutoProcessor), and provided the qwen-vl-utils package for users who need more precise control.

fyabc avatar Sep 02 '24 02:09 fyabc

I happened to notice something while following this PR. I've merged it locally to run some tests, and during testing, I encountered a strange issue where, after several successful executions, only '!' is repeatedly output. Initially, everything runs smoothly, but after repeated runs, this symptom appears.

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2-VL-7B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://sungyesa.com/new/data/file/secret/988486029_UV9Hq6Zt_IMG_8053.jpeg"
          }
        },
        {
          "type": "text",
          "text": "Describe this image"
        }
      ]
    }
  ],
  "max_tokens": 4096,
  "temperature": 0.0,
  "stream": false
}'

Failure case example

{"id":"chat-9a419b2e12bc405fa58b4a667709b5c9","object":"chat.completion","created":1725205065,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":3332,"total_tokens":7428,"completion_tokens":4096},"prompt_logprobs":null}

Success case example (tested with a different image)

{"id":"chat-6acd666dca494385b6c7fd335f71d332","object":"chat.completion","created":1725203421,"model":"Qwen/Qwen2-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a person and a dog sitting on a sandy beach at sunset. The person is wearing a plaid shirt and dark pants, and they are sitting cross-legged on the sand. The dog, which appears to be a large breed, is wearing a harness and is sitting next to the person, extending its paw towards the person's hand. The dog's paw is touching the person's hand, creating a playful and affectionate moment. The background shows the ocean with gentle waves and a clear sky, with the sun setting on the horizon, casting a warm, golden light over the scene. The overall atmosphere is serene and joyful, capturing a special bond between the person and their pet.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":3603,"total_tokens":3743,"completion_tokens":140},"prompt_logprobs":null}

One line that I changed when merging the PR is as follows. Could this have affected the issue? vllm/model_executor/models/qwen2_vl.py

from vllm_flash_attn.flash_attn_interface import flash_attn_varlen_func
# from flash_attn import flash_attn_varlen_func

This is probably related to the observation from the author as well:

Qwen2-VL uses flash-attn==2.6.1 (instead of vllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 in vllm/model_executor/models/qwen2_vl.py). Current vllm-flash-attn version will output NaN logits value, and I am still debugging this bug.

Have you observed the same issue with flash-attn? @smartdolphin

Yes @ywang96 @smartdolphin , vllm-flash-attn==2.6.1 will output NaN logits in some cases, and NaN logits will cause '!!!' outputs (NaN => output token_id == 0 => '!').

I haven't found the cause of this bug yet...

fyabc avatar Sep 02 '24 02:09 fyabc

The use of qwen-vl-utils is quite different from the existing models which fully rely on the AutoProcessor from HuggingFace. Is there a particular reason why the preprocessing logic for this model is being split across AutoProcessor and qwen-vl-utils?

@DarkLight1337 The goal of qwen-vl-utils is to allow users to control the parameters of each single image/video (min_pixels, max_pixels, fps, etc.) Current transformers & vllm implementation can run without the qwen-vl-utils package, but at this time users can only globally control those parameters by modifying preprocessor_config.json.

When we merging Qwen2-VL into transformers, we have some discussions with reviewers: huggingface/transformers#32318 (comment) huggingface/transformers#32318 (comment) finally, we follow the usual trend (passing processed PIL/arrays to AutoProcessor), and provided the qwen-vl-utils package for users who need more precise control.

I see, thanks for the detailed explanation! It looks like qwen-vl-utils acts like vllm.entrypoints.chat_utils in that regard.

DarkLight1337 avatar Sep 02 '24 02:09 DarkLight1337

How long will it take to merge?

PancakeAwesome avatar Sep 03 '24 06:09 PancakeAwesome

We need to wait for transformers to update their version.

DarkLight1337 avatar Sep 03 '24 06:09 DarkLight1337

support qwen2 vl awq model?

seanzhang-zhichen avatar Sep 04 '24 01:09 seanzhang-zhichen

I found that GPTQ quantization will prompt the following error. If I skip these weight readings according to the qwen2 code, it will run. Is it reasonable to merge such changes directly into qwen2-vl? KeyError: 'model.layers.0.mlp.down_proj.bias'

pengxuan2022 avatar Sep 04 '24 02:09 pengxuan2022

support qwen2 vl awq model?

@seanzhang-zhichen Yes, this PR support AWQ model. You can check this issue for more details.

fyabc avatar Sep 04 '24 02:09 fyabc

I found that GPTQ quantization will prompt the following error. If I skip these weight readings according to the qwen2 code, it will run. Is it reasonable to merge such changes directly into qwen2-vl? KeyError: 'model.layers.0.mlp.down_proj.bias'

Hi @pengxuan2022 , the previous version of Qwen2-VL-xxB-GPTQ-IntXX models have unused biases, we have updated the model parameters, and you can download the new weights from HuggingFace. You can check this issue for more details.

fyabc avatar Sep 04 '24 02:09 fyabc

@WoosukKwon please take a look at the mrope implementation in this PR when you get the chance. I saw that you were refactoring rope classes so perhaps you could take this into account and help integrate mrope into the existing framework.

DarkLight1337 avatar Sep 05 '24 04:09 DarkLight1337

Huge thanks for the PR! We are very happy to support this model very soon :)

I just left some comments mostly requiring some clarifications about mrope. While we don't have to optimize it in this PR, it'd be nice to have a clean and well-documented implementation so that anyone can optimize it easily in the future.

Thank you for your comments! I will update this PR later.

fyabc avatar Sep 05 '24 06:09 fyabc

tensor_parallel_size > 1 failed on rank > 0 workers:

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

vexilligera avatar Sep 05 '24 07:09 vexilligera

Hi @DarkLight1337 @ywang96 @WoosukKwon , I have updated this PR according to review comments, please check it again:

  1. Add xformers backend of Qwen2VisionAttention, remove the dependency on flash-attn.
  2. Refactor MRotaryEmbedding.forward to align with RotaryEmbedding.

fyabc avatar Sep 06 '24 12:09 fyabc

I found that vllm's text model directly reuses Qwen2 code. In hf code, the text model position encoding function apply_multimodal_rotary_pos_emb of qwen2vl and the apply_rotary_pos_emb code of qwen2 are somewhat different. Are the position encoding of the text model and the position encoding of qwen2 the same?

pengxuan2022 avatar Sep 09 '24 01:09 pengxuan2022

I found that vllm's text model directly reuses Qwen2 code. In hf code, the text model position encoding function apply_multimodal_rotary_pos_emb of qwen2vl and the apply_rotary_pos_emb code of qwen2 are somewhat different. Are the position encoding of the text model and the position encoding of qwen2 the same?

@pengxuan2022 Hi, in Qwen2Model -> Qwen2Attention, we use get_rope function to initialize rotary embedding module from checkpoint config. So text model in Qwen2-VL also uses mrope rotary position embeddings, this behavior is consistent with HF.

fyabc avatar Sep 09 '24 07:09 fyabc

The VLM part looks good to me overall. Just a few things left to do:

  • Could you add a test case (under tests/) to verify that the model works?
  • Having an example (inside examples/offline_inference_vision_language.py or its multi-image counterpart) would be great so that people won't have to refer to this PR directly.

@DarkLight1337 Hi, if I want to use qwen-vl-utils package inside example, should I add it into requirements-test.txt (or directly add into requirements-common.txt)?

fyabc avatar Sep 09 '24 07:09 fyabc