vllm [RFC]: Support for video input

Motivation.

Currently models like llava-hf/llava-next-video* recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.

Proposed Change.

API

LLM.generate() API for video

LLM.generate({
    "prompt": "<video> please summarize this video",
    "multi_modal_data": {
        "video": video
    }
})

OpenAI compatible chat completion APIs

Roadmap

Add VideoPlugin for MultiModalPlugin
#7559
- Add initial support for replacing a <video> token with a single video.
- Add support for replacing all <video> and <image> tokens with multiple multi-modal inputs.
Support prefix caching for the same videos.
Support openai chat completion APIs.

Feedback Period.

A week

CC List.

@DarkLight1337 @zifeitong @ywang

Any Other Things.

No response

Aug 15 '24 15:08 TKONIY

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Sep 12 '24 12:09 PancakeAwesome

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC @DarkLight1337 @ywang96

Sep 12 '24 15:09 TKONIY

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC @DarkLight1337 @ywang96

@TKONIY Sounds good - we will take it over from here. Thank you for the contribution for the model!

Sep 14 '24 08:09 ywang96

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: https://github.com/vllm-project/vllm/issues/9128#issuecomment-2399642038.

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

Oct 08 '24 12:10 sayakpaul

Cc @TKONIY if https://github.com/vllm-project/vllm/issues/7558#issuecomment-2399698986 sounds interesting.

Oct 13 '24 16:10 sayakpaul

@sayakpaul Sorry that I did not quite follow up with the vLLM development recently. If you need to implement the online serving support for video/ multi video, I think you can start from an RFC to define the http API (which has not been supported by official openai api). Then modify some frontend code in entrypoint/ to connect it with the llm engine.

Oct 14 '24 10:10 TKONIY

Currently, I am doing this: https://github.com/vllm-project/vllm/issues/9128#issuecomment-2399642038

Oct 14 '24 13:10 sayakpaul

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: #9128 (comment).

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

Hey @sayakpaul! Sorry for the late reply and this is definitely interesting.

AFAIK, only Llava-OneVision supports multi-video captioning once https://github.com/vllm-project/vllm/pull/8905 is merged, and this is more of a model capability than of the inference infrastructure capability (In my knowledge, video language inference is more or less the same as multi-image inference, so it's a matter of whether the model itself was trained to do this task)

Oct 15 '24 04:10 ywang96

AFAIK, only Llava-OneVision supports multi-video captioning once https://github.com/vllm-project/vllm/pull/8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

Oct 15 '24 04:10 sayakpaul

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

Yea - I'll keep that in mind and update this issue once we have some bandwidth to get #8905 merged. Will probably update our example scripts too!

Oct 15 '24 05:10 ywang96

Closing as completed since #8905 and #9842 have both been resolved.

Nov 08 '24 03:11 DarkLight1337

@DarkLight1337 which doc part we should refer to for this?

Nov 08 '24 10:11 sayakpaul

Please see #9842 on how to use it.

Nov 08 '24 10:11 DarkLight1337

@litianjian can you add some docs/examples for this?

Nov 08 '24 10:11 DarkLight1337

@DarkLight1337 thanks!

Is https://github.com/vllm-project/vllm/pull/10020#issue-2634358870 a sufficiently good example for me as a reference?

Nov 09 '24 16:11 sayakpaul

@DarkLight1337 thanks!

Is #10020 (comment) a sufficiently good example for me as a reference?

Yes, that should be clear enough.

Nov 09 '24 16:11 DarkLight1337

@litianjian can you add some docs/examples for this? Sure

Nov 11 '24 02:11 litianjian