VILA
VILA copied to clipboard
YouCook2 code to generate video clips from raw videos?
The youcook2 data repository (http://youcook2.eecs.umich.edu/download) only provides a script to download the raw videos into a folder .../youcook2/raw_videos/. However, the entries in the youcook_filtered_v3.json file has entries like
{
"id": "TyR6QO1pVCo_4",
"video": "TyR6QO1pVCo_4.mp4",
"conversations": [
{
"from": "human",
"value": "Create a compact narrative representing the video presented.\n<video>"
},
{
"from": "gpt",
"value": "pour the rice into a bowl"
}
],
"frame_count": 631,
"fps": 29.97002997002997
}
and in data_mixtures.py, the definition of the youcook2 mixture has videos files referenced from the directory video_data_clipped.
Could you provide details on how you generated the clipped videos or provide the script used to do it? I'm guessing it was done by reading the youcookii_annotations_trainval.json file and using ffmpeg to split each raw video into the corresponding clip, but any confirmation/details would be helpful.
Yes, exactly! You can use the annotation file and ffempg to clip the video into smaller clips.
Does VILA randomly sample from frames and send to vit?
Does they using directly 631 frames to training?
Hi we uniformly sample 8 frames for each video clip.
@XueFuzhao is it evenly resampling for 8 out of 631 in above examples? How does the multiple images send into s2-siglip? thanks for the indications.