OFA
OFA copied to clipboard
Effect of VQGAN code randomness
I understand from #258 that there is randomness in the generated VQGAN code sequences because of Gumbel Softmax, but the different sequences nevertheless reconstruct to similar looking images. However, since the training is done by predicting the sequence tokens and not by comparing the reconstructed images themselves, I am wondering if and how having different token sequences will affect the pretraining and downstream performance? Was this something that had been investigated to check for consistency in performance across different variations of the generated code sequences?
A good question. In our preliminary experiments, we found that using different sequences can slightly improve the model performance, it seems that the randomness in the vqgan encoding process becomes some data augments or label smoothing. But we didn't conduct a more in-depth quantitative study.
I see. So what you're saying is that there is some value in using multiple (slightly different) sequences representing the same image and this could be interpreted as data augmentation on the sequences used for the Image Infilling task. Interesting take. I would like to try and explore this further.