VideoMAEv2
VideoMAEv2 copied to clipboard
Turning VideoMAEv2 into a next-frame prediction model
Great work and thanks for the code!
I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?
I've done similar experiments and achieved similar results to mae. Based on my limited experimental results:predictive features are easier to train than predictive pixels; the potential of this training method may be higher than mae; and the resource overhead may be greater. There should be some similar (predictive or autoregressive) work recently, such as v-jepa, aim, etc. You could learn more about it.