VILA icon indicating copy to clipboard operation
VILA copied to clipboard

YouCook2 code to generate video clips from raw videos?

Open hubenjm opened this issue 1 year ago • 4 comments

The youcook2 data repository (http://youcook2.eecs.umich.edu/download) only provides a script to download the raw videos into a folder .../youcook2/raw_videos/. However, the entries in the youcook_filtered_v3.json file has entries like

{
        "id": "TyR6QO1pVCo_4",
        "video": "TyR6QO1pVCo_4.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "Create a compact narrative representing the video presented.\n<video>"
            },
            {
                "from": "gpt",
                "value": "pour the rice into a bowl"
            }
        ],
        "frame_count": 631,
        "fps": 29.97002997002997
}

and in data_mixtures.py, the definition of the youcook2 mixture has videos files referenced from the directory video_data_clipped.

Could you provide details on how you generated the clipped videos or provide the script used to do it? I'm guessing it was done by reading the youcookii_annotations_trainval.json file and using ffmpeg to split each raw video into the corresponding clip, but any confirmation/details would be helpful.

hubenjm avatar May 13 '24 19:05 hubenjm

Yes, exactly! You can use the annotation file and ffempg to clip the video into smaller clips.

XueFuzhao avatar May 15 '24 06:05 XueFuzhao

Does VILA randomly sample from frames and send to vit?

Does they using directly 631 frames to training?

lucasjinreal avatar May 17 '24 08:05 lucasjinreal

Hi we uniformly sample 8 frames for each video clip.

XueFuzhao avatar May 17 '24 08:05 XueFuzhao

@XueFuzhao is it evenly resampling for 8 out of 631 in above examples? How does the multiple images send into s2-siglip? thanks for the indications.

lucasjinreal avatar May 17 '24 09:05 lucasjinreal