Open-Sora caption_llava.py to caption videos. Only use 1 frame?

caption_llava.py to caption videos. Only use 1 frame?

Open QiaoZhennn opened this issue 10 months ago • 3 comments

In readme it says it will use 3 frames to run caption.
But the example cmd in readme uses num_frames = 1 and prompt = image_3ex, which means using 1 frame per video?
About default args in main function, it uses prompt = video_1f_3ex, which means also using 1 frame per video?

Apr 25 '24 00:04 QiaoZhennn

Yes, we only use 1 frame. In open-sora 1.0, we misunderstood the captioning configs. Actually, we are always using 1 frame per video. In open-sora 1.1, we also use one frame only.

Apr 25 '24 11:04 zhengzangw

How are you hoping to generate a caption that describes the spatial and temporal features of a clip using 1 frame?

Apr 25 '24 20:04 yumlevi

For datasets already with captions (e.g., Panda-70M), we directly use it. 1 frame caption can achieve good results for short videos.

Apr 26 '24 02:04 zhengzangw

Open-Sora Open-Sora copied to clipboard

caption_llava.py to caption videos. Only use 1 frame?

Open-Sora
Open-Sora copied to clipboard