vllm [Frontend] re-enable multi-modality input in the new beam search implementation

[Frontend] re-enable multi-modality input in the new beam search implementation

Open FerdinandZhong opened this issue 4 months ago • 5 comments

Changes in this PR:

This PR introduces the following changes based on the updated beam search implementation:

Re-enable multi-modality input: Support for multi-modality input has been re-enabled for beam search with OpenAI-compatible endpoints.
Logprobs handling in ChatCompletionRequest: Added additional validation to disable logprobs when use_beam_search=True. Since the beam search selects results based on cumulative logprobs and determines step logprobs by beam_width, it ignores the top_logprobs and logprobs parameters passed in with the request.

Unit Test

Added two additional test cases in tests/entrypoints/openai/test_vision.py.

Manual Testing

The following command was used to launch the server for manual testing: vllm serve microsoft/Phi-3.5-vision-instruct --api-key token-abc123 --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2

Client script used to test the changes:

import openai
import asyncio


url = "http://localhost:"
client = openai.AsyncOpenAI(
    base_url = "http://localhost:8000/v1",
    api_key="token-abc123"
)


# Image URLs
img_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/c/cb/Brachiosaurus_DB_flipped.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/3/3d/Allosaurus_Revised.jpg"
]

# Define the messages for the chat completion
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": img_urls[0]
                }
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": img_urls[1]
                }
            },
            {
                "type": "text",
                "text": "what are the animals in the images?"
            }
        ]
    }
]

async def make_request():
    try:
        response = await client.chat.completions.create(
            model="microsoft/Phi-3.5-vision-instruct",
            max_tokens=32,
            temperature=0,
            messages=messages,
            n=2,
            extra_body={"use_beam_search": True}
        )
        for choice in response.choices:
            print(choice.message.content)

    except openai.BadRequestError as e:
        print(f"Error: {e.code}")

asyncio.run(make_request())

Verified the functionality of multi-image input handling and correct response generation using beam search with the above manual tests.

Oct 16 '24 16:10 FerdinandZhong

vllm vllm copied to clipboard

[Frontend] re-enable multi-modality input in the new beam search implementation

vllm
vllm copied to clipboard