LLaVA-NeXT Fix: videos in LLaVa-OV

Fix: videos in LLaVa-OV

Open zucchini-nlp opened this issue 1 year ago • 0 comments

Currently running the demo notebook for LLaVA OneVision for video modality doesn't apply pooling for all video patches/frames, because the modality list holds values for each prompt, while videos can contain several frames. This PR replicates the modality list by copying it for all video frames in the demo notebook

I tried to see if we can expand the modalities inside modeling code, but seems like it's hard to infer which visual in the input is image or video, so I decided to delegate expansion to users.

Aug 30 '24 09:08 zucchini-nlp

LLaVA-NeXT LLaVA-NeXT copied to clipboard

Fix: videos in LLaVa-OV

LLaVA-NeXT
LLaVA-NeXT copied to clipboard