text-generation-inference
text-generation-inference copied to clipboard
Enable qwen2vl video
This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started.
TODOS
- [X] suport
video_urls - [X] fetch video contents in router
- [X] update protobufs to support video chunks
- [X] handle padding video token inputs
- [X] tokenize video bytes
- [X] integrate video logic with vision model (update position ids)
- [x] ensure tokenization process is correct
- [x] add tests
- [x] refactor/improve
update*
start server
text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-batch-prefill-tokens 10000 \
--max-input-tokens 10000 \
--max-total-tokens 10001
send request
import requests
import json
def chat_completion(url="http://127.0.0.1:3000", video_url=None, prompt=None):
messages = [{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url": video_url
}
},
{
"type": "text",
"text": prompt
}
]
}]
payload = {
"messages": messages,
"seed": 42,
"max_tokens": 30
}
response = requests.post(
f"{url}/v1/chat/completions",
json=payload,
headers={"Content-Type": "application/json"}
)
return response.json()
video_url = "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"
result = chat_completion(
video_url=video_url,
prompt="Describe this video."
)
print(json.dumps(result, indent=2))
# {
# "object": "chat.completion",
# "id": "",
# "created": 1731964042,
# "model": "Qwen/Qwen2-VL-7B-Instruct",
# "system_fingerprint": "2.4.1-dev0-native",
# "choices": [
# {
# "index": 0,
# "message": {
# "role": "assistant",
# "content": "The video showcases lush green trees with vibrant shades of green and various shades of yellow and brown, as well as moss-covered stumps and piles of moss",
# },
# "logprobs": null,
# "finish_reason": "length",
# }
# ],
# "usage": {"prompt_tokens": 9593, "completion_tokens": 30, "total_tokens": 9623},
# }