segment-anything
segment-anything copied to clipboard
How is ViTDet backbone pretrained with MAE?
In the paper they have mentioned that image encoder pretrained using MAE is used. Just want to understand how network is pretained using MAE when window size is (14, 14). Do we pretrain on window size of (0, 0) and then fine tune on (14, 14).
Thanks