[model]Understanding video with images as in-context

Open kassy11 opened this issue 2 years ago • 1 comments

I want to give some images to the model as an in-cotext, then input the video and ask questions about the video content. (Specifically, I would like to teach the model the type of dogs as images and then have the model count the number of dogs in the video.) multimodal

The Otter-image model can be given an image as context, but no video can be input. And, the Otter-video model cannot be given an image as context, but video can be input.

Is there an optimal implementation method or model for this type of situation?

Sep 18 '23 06:09 kassy11

I have the same needs!!! Have you solved it?

Oct 26 '23 06:10 hcwei13