jepa
jepa copied to clipboard
long video
Dear V-Jepa team,
Thank you for sharing this great work; I really enjoyed it.
If I understand correctly, the model is only trained with a video of 16 frames (after frame skipping, around 3s). Does it work with long videos with long frames (>60 frames, >10s or >30s)? Or do I need to fine-tune it?
Thank you for your help.
Best Wishes,
Zongze
For longer videos, the authors split the video into several clips, each longer than 3 seconds. In each clip they randomly sampled a 64-frame slice. So you would run the model for each clip and concatenate the clip-level latents, then use those.