Tianhong Li
Tianhong Li
> For Figure 1 in the paper, as mentioned in the caption, "the mask for MAGE is on semantic tokens whereas that of MAE is on patches in the input...
The mask of MAGE is always on tokens -- even when the original masking is on the pixels (as in the inpainting scenario), we need to transfer it into masking...
Thanks for your interest! The masking ratio is left truncated by 0.5 so that we can always drop 50% of the input tokens in the ViT encoder, which largely saves...
Our evaluation protocol is based on both FID and linear probing accuracy -- once we train a model with certain hyper-parameters, we evaluate it on ImageNet and pick the best...
We tried using CLS token. However, the performance is not very stable -- normally it achieves similar performance as average pooled features, but occasionally it gets very poor accuracy (~10%)....
The smallest batch size I tested is 1024, which gives a similar performance. Since we have a learning rate scaling w.r.t. the batch size, I guess the performance will not...
The implementation detail is as follows: during training, a masking ratio (`mr`) between 0.5-1 is sampled for each iteration to mask out the input image tokens. Since `mr` is always...
Hi, thanks for your interest! Yes, the vocab_size should be self.codebook_size + 1 when there is no class condition. We set it to self.codebook_size + 1000 + 1 just because...
You can actually set it to any value larger than or equal to 1024, and smaller than 1024+1000+1 -- but the pre-trained model set it to 1100 (again, a legacy...
Unfortunately I don't have access to the original JAX code now, so there is no plan to release contrastive training part. However, that part is quite straight-forward if you want...