long video

Open betterze opened this issue 1 year ago • 1 comments

Dear V-Jepa team,

Thank you for sharing this great work; I really enjoyed it.

If I understand correctly, the model is only trained with a video of 16 frames (after frame skipping, around 3s). Does it work with long videos with long frames (>60 frames, >10s or >30s)? Or do I need to fine-tune it?

Thank you for your help.

Best Wishes,

Zongze

Mar 15 '24 18:03 betterze

For longer videos, the authors split the video into several clips, each longer than 3 seconds. In each clip they randomly sampled a 64-frame slice. So you would run the model for each clip and concatenate the clip-level latents, then use those.

Apr 30 '24 00:04 sumo43