Compact-Transformers icon indicating copy to clipboard operation
Compact-Transformers copied to clipboard

About Mask Autoencoder

Open ZK-Zhou opened this issue 2 years ago • 2 comments

Thank you for your work and code. Have you ever tried MAE like pre-training on CCT? I have tried it initially, but it doesn't seem to work well. Could you give me some guidance? Thanks!

ZK-Zhou avatar Mar 23 '23 12:03 ZK-Zhou

We have not tried this, nor some of the other pre-training techniques. We'd be curious about your results. Our paper is more about demonstrating that ViTs can work on small datasets without pre-training (of any form, including transfer learning) and can be effectively trained from scratch. Essentially we create a better "tokenization" and embedding for ViTs. The idea is to show that ViTs don't need big compute, big data, nor complex training schemes to be useful.

That said, the same pre-training techniques should apply. Per MAE, you'd need to create a decoder network to do the pixel reconstruction. Typically AEs have symmetric encoders and decoders, but MAE doesn't -- encoder has a depth of 24 and decoder has depth of 8, they also have different embedding dims with the decoder being half the encoder's. If you'd like to perform the same pretraining as them I suggest carefully looking at FAIR's repo. They have instructions for this pre-training. But most importantly, you'd need to build the decoder (which will later be discarded).

stevenwalton avatar Mar 24 '23 21:03 stevenwalton

Hi, thank you for your kind reply! The main problems of pre-training using CCT are information leakage caused by the previous convolutional layer and information loss caused by the Maxpooling layer. So I am trying to improve CCT-MAE. Have a nice day!

ZK-Zhou avatar Mar 25 '23 02:03 ZK-Zhou