VideoMAEv2 Turning VideoMAEv2 into a next-frame prediction model

Turning VideoMAEv2 into a next-frame prediction model

Open IoSonoMarco opened this issue 1 year ago • 1 comments

Great work and thanks for the code!

I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?

Nov 01 '23 08:11 IoSonoMarco

I've done similar experiments and achieved similar results to mae. Based on my limited experimental results:predictive features are easier to train than predictive pixels; the potential of this training method may be higher than mae; and the resource overhead may be greater. There should be some similar (predictive or autoregressive) work recently, such as v-jepa, aim, etc. You could learn more about it.

Mar 14 '24 08:03 congee524

VideoMAEv2 VideoMAEv2 copied to clipboard

Turning VideoMAEv2 into a next-frame prediction model

VideoMAEv2
VideoMAEv2 copied to clipboard