vit-pytorch how to use this model for image generation?

Thanks for the great work. I removed the classification head and trying to use this repo for image generation but I get really bad results. All images have patchy looks and very low quality. I played with number of heads, number of layers, LR etc, but didnt really matter.

What would be the most sensible approach to generate images with the encoder part?

Nov 19 '20 06:11 basamelatex

@basamelatex no one has shown that this can work with a straight encoder yet afaik, but people have discretized the pixel space and then used a decoder to generate the image as with iGPT and Image Transformer

Nov 19 '20 17:11 lucidrains

Thanks a lot for your answer, I checked out the papers you mentioned above. I noticed that they were able to generate only quite small images such as 64x64 and used relatively small datasets like CIFAR10. On the other hand, in the Vit paper they were suggesting that the model doesn't work well on small datasets. Do you think this would be the case in image generation as well? Do we really need a huge dataset for Vit to work on image generation? I would like to give it a try, but I feel a bit skeptical after seeing 300M dataset they use..

Nov 19 '20 19:11 basamelatex

I would like to know how to use this model for spatial-temporal state forecast, such as nowcasting using radar echo, like ConvLSTM.

Nov 29 '20 07:11 bugsuse