Video-LLaVA Relation of Video-LLaVA and LanguageBind

Relation of Video-LLaVA and LanguageBind

Open song-wensong opened this issue 1 year ago • 1 comments

Excellent job!

I have three questions that are not clear to me.

I have some problems with Relation of Video-LLaVA and LanguageBind. Has Video-LLaVa use video encoder of LanguageBind?
Now I have a fMRI(a modality) encoder. If I want to do contrastive learning between fMRI, video and corresponding caption(use Video-LLaVA to get caption of video), should I do it based on LanguageBind?
Does Video-LLaVA have any requirements for the number of video frames?

Feb 07 '24 14:02 song-wensong

For 1, yes. For 2, you can try both two ways to get the best choice. For 3, it depends on video encoder, which only support 8 frames now.

Feb 07 '24 15:02 LinB203