Open-Sora
Open-Sora copied to clipboard
caption_llava.py to caption videos. Only use 1 frame?
- In readme it says it will use 3 frames to run caption.
- But the example cmd in readme uses num_frames = 1 and prompt = image_3ex, which means using 1 frame per video?
- About default args in main function, it uses prompt = video_1f_3ex, which means also using 1 frame per video?
Yes, we only use 1 frame. In open-sora 1.0, we misunderstood the captioning configs. Actually, we are always using 1 frame per video. In open-sora 1.1, we also use one frame only.
How are you hoping to generate a caption that describes the spatial and temporal features of a clip using 1 frame?
For datasets already with captions (e.g., Panda-70M), we directly use it. 1 frame caption can achieve good results for short videos.