lightly icon indicating copy to clipboard operation
lightly copied to clipboard

SimMIM: A Simple Framework for Masked Image Modeling

Open guarin opened this issue 3 years ago • 5 comments

SimMIM: A Simple Framework for Masked Image Modeling

18.11.2021 https://arxiv.org/abs/2111.09886 https://github.com/microsoft/SimMIM

Similar architecture as MAE but uses only a single linear layer as decoder instead of a transformer, passes masked and non-masked tokens to the encoder, and uses l1 instead of l2 loss.

Screenshot 2022-04-26 at 09 23 04

Estimated effort to implement in Lightly: Low, once MAE is implemented

  • Add linear decoder to MAE
  • Add l1 loss to MAE
  • Pass all tokens to the encoder
  • It probably makes more sense to implement it only using ViT as Swin is not implemented in torchvision.

guarin avatar Apr 26 '22 07:04 guarin

In the appendix, section E of the SimMIM paper they are testing with a ResNet50 - which also shows performance above that of other self-supervised approaches with convolutional neural networks (BYOB etc).

Will such a configuration also be possible with Lightly? Convnets have the advantage of being very well understood, have good compute performance, and are easy to deploy - including to accelerator chips.

jonnor avatar Apr 28 '22 10:04 jonnor

Great find! I completely missed this part of the paper.

It looks relatively simple to implement but is not part of the official code which makes it a bit hard to assess if there are any pitfalls. We could give it a try, it could be even easier to implement than the transformer version. We just have to implement the masking for resnets and slightly adapt the forward pass.

guarin avatar Apr 28 '22 15:04 guarin

Looks like masked autoencoders now went in (#799), which is maybe a good starting point for SimMIM?

jonnor avatar Jun 07 '22 11:06 jonnor

Yes, this should be relatively easy to implement now if I remember our discussions correctly @guarin

philippmwirth avatar Jun 07 '22 11:06 philippmwirth

Yes, I think we can just combine the MAEBackbone, a linear prediction head and the l1 loss in a module to build the SimMIM model. I would propose to add this as a new module in the imagenette benchmark file. Then we can test it and also see if we want to do some refactorings or add building blocks.

Adding support for resnets would involve some more work as we have to write a new backbone that supports masking.

guarin avatar Jun 07 '22 12:06 guarin

Closed by: #1003

guarin avatar Dec 08 '22 16:12 guarin