vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[RFC]: Support for video input

Open TKONIY opened this issue 1 year ago • 2 comments

Motivation.

Currently models like llava-hf/llava-next-video* recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.

Proposed Change.

API

Roadmap

  • Add VideoPlugin for MultiModalPlugin
  • #7559
    • Add initial support for replacing a <video> token with a single video.
    • Add support for replacing all <video> and <image> tokens with multiple multi-modal inputs.
  • Support prefix caching for the same videos.
  • Support openai chat completion APIs.

Feedback Period.

A week

CC List.

@DarkLight1337 @zifeitong @ywang

Any Other Things.

No response

TKONIY avatar Aug 15 '24 15:08 TKONIY

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

PancakeAwesome avatar Sep 12 '24 12:09 PancakeAwesome

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC @DarkLight1337 @ywang96

TKONIY avatar Sep 12 '24 15:09 TKONIY

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC @DarkLight1337 @ywang96

@TKONIY Sounds good - we will take it over from here. Thank you for the contribution for the model!

ywang96 avatar Sep 14 '24 08:09 ywang96

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: https://github.com/vllm-project/vllm/issues/9128#issuecomment-2399642038.

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

sayakpaul avatar Oct 08 '24 12:10 sayakpaul

Cc @TKONIY if https://github.com/vllm-project/vllm/issues/7558#issuecomment-2399698986 sounds interesting.

sayakpaul avatar Oct 13 '24 16:10 sayakpaul

@sayakpaul Sorry that I did not quite follow up with the vLLM development recently. If you need to implement the online serving support for video/ multi video, I think you can start from an RFC to define the http API (which has not been supported by official openai api). Then modify some frontend code in entrypoint/ to connect it with the llm engine.

TKONIY avatar Oct 14 '24 10:10 TKONIY

Currently, I am doing this: https://github.com/vllm-project/vllm/issues/9128#issuecomment-2399642038

sayakpaul avatar Oct 14 '24 13:10 sayakpaul

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: #9128 (comment).

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

Hey @sayakpaul! Sorry for the late reply and this is definitely interesting.

AFAIK, only Llava-OneVision supports multi-video captioning once https://github.com/vllm-project/vllm/pull/8905 is merged, and this is more of a model capability than of the inference infrastructure capability (In my knowledge, video language inference is more or less the same as multi-image inference, so it's a matter of whether the model itself was trained to do this task)

ywang96 avatar Oct 15 '24 04:10 ywang96

AFAIK, only Llava-OneVision supports multi-video captioning once https://github.com/vllm-project/vllm/pull/8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

sayakpaul avatar Oct 15 '24 04:10 sayakpaul

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

Yea - I'll keep that in mind and update this issue once we have some bandwidth to get #8905 merged. Will probably update our example scripts too!

ywang96 avatar Oct 15 '24 05:10 ywang96

Closing as completed since #8905 and #9842 have both been resolved.

DarkLight1337 avatar Nov 08 '24 03:11 DarkLight1337

@DarkLight1337 which doc part we should refer to for this?

sayakpaul avatar Nov 08 '24 10:11 sayakpaul

Please see #9842 on how to use it.

DarkLight1337 avatar Nov 08 '24 10:11 DarkLight1337

@litianjian can you add some docs/examples for this?

DarkLight1337 avatar Nov 08 '24 10:11 DarkLight1337

@DarkLight1337 thanks!

Is https://github.com/vllm-project/vllm/pull/10020#issue-2634358870 a sufficiently good example for me as a reference?

sayakpaul avatar Nov 09 '24 16:11 sayakpaul

@DarkLight1337 thanks!

Is #10020 (comment) a sufficiently good example for me as a reference?

Yes, that should be clear enough.

DarkLight1337 avatar Nov 09 '24 16:11 DarkLight1337

@litianjian can you add some docs/examples for this? Sure

litianjian avatar Nov 11 '24 02:11 litianjian