VideoMAE
VideoMAE copied to clipboard
Why use the original mean and var of each patch when visualizing the reconstruction video?
When I set a high mask ratio, some unpredictable content will be roughly predicted.
Because the target of reconstruction task is the normalized pixel instead of pixel