Tianhong Li
Tianhong Li
In fact we started from MaskGIT's BERT architecture, but we find both linear probing and unconditional generation performance are poor (57.4% accuracy, 20.7 FID). Then we find that adopting the...
We must use image tokens as both input and output to enable image generation, because image generation takes multiple steps. In the middle of generation, only part of the tokens...
We did not try that, but it could be an option to create the token_emb. We use a new module token_emb because we simply regard the tokens as words, and...
We may release a colab for image editing in the next few months.
Yes, the temperature selection is a bit tricky. For iter=1, we use argmax and temp=0.0 if I remember correctly. For iter=6, we use categorical sampling and temp=4.5.
You can use either ViT-B or ViT-L for that. ViT-L will give you slightly visually better results.
> I add an inference code, but looks not correct. Here is a input and output for my code.   > > The inference code is based on gen_img_uncond.py,...
It should follow the same structure as ImageNet data, with train/class_name/images.png and val/class_name/images.png
My suggestion is to replace the default ImageNet dataloader with your own dataloader. Once that is done, you can use the unlabeled image data with main_pretrain.py and use the labeled...
We only use the ImageNet images and never use the label information during pre-training.