WhisperFusion
WhisperFusion copied to clipboard
Can we have this working with a vision language model?
Examples:
- https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
- https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf
- https://huggingface.co/Vision-CAIR/MiniGPT4-Video