MAE-pytorch icon indicating copy to clipboard operation
MAE-pytorch copied to clipboard

Need to interpolate positional embedding to work at higher resolutions

Open atonderski opened this issue 3 years ago • 0 comments

Hi again, sorry for the slow response in issue #26. I have some more clarifications and visualizations here.

I agree that the sine-cosine embeddings are not learnable. However it seems like they still need to be interpolated for the model to work well. I suspect that this is at least partially due to the fact that they are 1d, and thus the model has to learn the number of rows/columns. E.g. it cannot express "look one patch down" directly, but rather needs to express it as "look X patches forward". And X changes if we change resolution.

I have attached attention visualizations that show what happens if you run on higher res with or without interpolating the positional embedding. As you can see, the non-interpolated version looks much worse and has weird diagonal stripes.

This is not a major issue to me, but I wanted to let you (and anyone else that has the same problem) know about this. I think the best solution is what I mentioned before: to simply include the positional embeddings in the checkpoint even though they are not learnable parameters.

Original: original_res With interpolation: with_interp Without interpolation: without_interp

atonderski avatar Nov 30 '21 09:11 atonderski