MAE-pytorch
MAE-pytorch copied to clipboard
Need to interpolate positional embedding to work at higher resolutions
Hi again, sorry for the slow response in issue #26. I have some more clarifications and visualizations here.
I agree that the sine-cosine embeddings are not learnable. However it seems like they still need to be interpolated for the model to work well. I suspect that this is at least partially due to the fact that they are 1d, and thus the model has to learn the number of rows/columns. E.g. it cannot express "look one patch down" directly, but rather needs to express it as "look X patches forward". And X changes if we change resolution.
I have attached attention visualizations that show what happens if you run on higher res with or without interpolating the positional embedding. As you can see, the non-interpolated version looks much worse and has weird diagonal stripes.
This is not a major issue to me, but I wanted to let you (and anyone else that has the same problem) know about this. I think the best solution is what I mentioned before: to simply include the positional embeddings in the checkpoint even though they are not learnable parameters.
Original:
With interpolation:
Without interpolation: