[RFC]: Support for video input
Motivation.
Currently models like llava-hf/llava-next-video* recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.
Proposed Change.
API
-
LLM.generate()API for videoLLM.generate({ "prompt": "<video> please summarize this video", "multi_modal_data": { "video": video } })
Roadmap
- Add
VideoPluginforMultiModalPlugin - #7559
- Add initial support for replacing a <video> token with a single video.
- Add support for replacing all <video> and <image> tokens with multiple multi-modal inputs.
- Support prefix caching for the same videos.
- Support openai chat completion APIs.
Feedback Period.
A week
CC List.
@DarkLight1337 @zifeitong @ywang
Any Other Things.
No response
Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you
Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you
Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.
CC @DarkLight1337 @ywang96
Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you
Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.
CC @DarkLight1337 @ywang96
@TKONIY Sounds good - we will take it over from here. Thank you for the contribution for the model!
@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.
I was able to use Qwen for this: https://github.com/vllm-project/vllm/issues/9128#issuecomment-2399642038.
I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!
Cc @TKONIY if https://github.com/vllm-project/vllm/issues/7558#issuecomment-2399698986 sounds interesting.
@sayakpaul Sorry that I did not quite follow up with the vLLM development recently. If you need to implement the online serving support for video/ multi video, I think you can start from an RFC to define the http API (which has not been supported by official openai api). Then modify some frontend code in entrypoint/ to connect it with the llm engine.
Currently, I am doing this: https://github.com/vllm-project/vllm/issues/9128#issuecomment-2399642038
@ywang96 I am quite interested in using
vllmfor high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.I was able to use Qwen for this: #9128 (comment).
I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!
Hey @sayakpaul! Sorry for the late reply and this is definitely interesting.
AFAIK, only Llava-OneVision supports multi-video captioning once https://github.com/vllm-project/vllm/pull/8905 is merged, and this is more of a model capability than of the inference infrastructure capability (In my knowledge, video language inference is more or less the same as multi-image inference, so it's a matter of whether the model itself was trained to do this task)
AFAIK, only Llava-OneVision supports multi-video captioning once https://github.com/vllm-project/vllm/pull/8905 is merged, and this is more of a model capability than of the inference infrastructure capability
Agree. Thanks for your inputs.
Would it be possible to update this thread with an example on doing multi-video captioning?
AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability
Agree. Thanks for your inputs.
Would it be possible to update this thread with an example on doing multi-video captioning?
Yea - I'll keep that in mind and update this issue once we have some bandwidth to get #8905 merged. Will probably update our example scripts too!
Closing as completed since #8905 and #9842 have both been resolved.
@DarkLight1337 which doc part we should refer to for this?
Please see #9842 on how to use it.
@litianjian can you add some docs/examples for this?
@DarkLight1337 thanks!
Is https://github.com/vllm-project/vllm/pull/10020#issue-2634358870 a sufficiently good example for me as a reference?
@DarkLight1337 thanks!
Is #10020 (comment) a sufficiently good example for me as a reference?
Yes, that should be clear enough.
@litianjian can you add some docs/examples for this? Sure