LLaVA-NeXT
LLaVA-NeXT copied to clipboard
Fix: videos in LLaVa-OV
Currently running the demo notebook for LLaVA OneVision for video modality doesn't apply pooling for all video patches/frames, because the modality list holds values for each prompt, while videos can contain several frames. This PR replicates the modality list by copying it for all video frames in the demo notebook
I tried to see if we can expand the modalities inside modeling code, but seems like it's hard to infer which visual in the input is image or video, so I decided to delegate expansion to users.