Otter
Otter copied to clipboard
[model]Understanding video with images as in-context
I want to give some images to the model as an in-cotext, then input the video and ask questions about the video content.
(Specifically, I would like to teach the model the type of dogs as images and then have the model count the number of dogs in the video.)
The Otter-image model can be given an image as context, but no video can be input. And, the Otter-video model cannot be given an image as context, but video can be input.
Is there an optimal implementation method or model for this type of situation?
I have the same needs!!! Have you solved it?