vq-vae-2-pytorch icon indicating copy to clipboard operation
vq-vae-2-pytorch copied to clipboard

Unbelievable results of the PixelSnail network

Open ZhanYangen opened this issue 3 years ago • 8 comments

Hi, just now I successfully ran the train_pixelsnail.py on the top levels (size: 8×8) of my own dataset, which consists of 300,000 encoded outcome from images (size: 64×64) of different sizes, rotations and numbers of geometrical shapes. After merely 5 steps (batches, 32 samples per batch) in epoch 1, the accuracy reached 100%. The spyder console displayed following messages when running: epoch: 1; loss: 2.26627; acc: 0.99561; lr: 0.00030: 0%| | 4/9375 [00:00<21:57, 7.11it/s] Moreover, as the number of step goes up, the loss would drop below 0.00001. Such an amazing result makes me can't help but wander whether it is because this network is indeed too powerful or there was something wrong with my adjusted code. Also, if it turns out to be the first scenario, is this model definitely overfitted? It is worth mentioning that the VQVAE-2 network also achieved amazing results. It took less than 5 minutes, specifically, 2 epochs over these 300,000 samples, to get following reconstruction images (the first row is original image, and the second is the reconstructed). Before this, I have already tested this dataset on a VAE network, but the results are far more obscure than that of this network, and took much longer to train it. So all of a sudden, it is kind of hard to accept this outcome... image

One more question, in fact I'm not sure why there is an accuracy indicator of some classifier in this network... Does it have something to do with the Section 3.3 in the paper? Or it is the accuracy that the outcome of the encoder of VQVAE to be classified to the right quantized vector?

Thanks.

ZhanYangen avatar Jan 19 '21 15:01 ZhanYangen

Well, here is some update about my status. I trained the PixelSnail network on the bottom level, and found that after 1 epoch the loss maintained to be around 0.7 and 80% or so for the accuracy. Finally the generated samples are as follows: image The huge difference between the effect on top and bottom level still confuses me a little...

and I found that it took about 5 seconds to generate 16 images, but 26 seconds to generate 200 images. Except this non-proportional relation between image number and the time cost, I wander if there is any way to speed up the generation process? Because the vanilla VAE network can easily generate thousands of images in a few seconds. Or is the slow speed due to the innate property of PixelCNN and thus is meant to be much slower?

ZhanYangen avatar Jan 20 '21 09:01 ZhanYangen

I don't know the exact reason, but maybe top level codes is much more predictable or simple. (Maybe your data has large local correlations, so you don't need to use more higher abstractions.)

Accuracy is just the probabilities that autoregressive model correctly predicts next tokens given previous sequences. I have added it just to have more metrics for model performances.

For sampling speeds, it is very hard to accelerate the sampling from autoregressive models. I have tried caching mechanisms for PixelSNAIL, but it only about 2x faster.

rosinality avatar Jan 20 '21 12:01 rosinality

@rosinality Got it! Thanks, that's really helpful.

ZhanYangen avatar Jan 20 '21 13:01 ZhanYangen

Extending on the answer provided by @rosinality, this phenomenon is well-explored by Gallucci et al. in this paper (attached). There are situations when the top code might collapse to a single value which can result in the loss going down quickly to zero during training of the top PixelSNAIL. The authors used the codes by @rosinality and showed that the PixelSNAIL training can be made somewhat tractable by varying n_embed and embed_dim, based on the application. The generated images in the paper also look pretty reasonable. vqvae2_gallucci.pdf

sbhadra2020 avatar Apr 07 '21 20:04 sbhadra2020

@sbhadra2020 Thanks for your useful advice! Indeed, I also found that the top code had collapsed to a single value.

ZhanYangen avatar Apr 08 '21 05:04 ZhanYangen

@ZhanYangen You're welcome. You previously posted some generated samples after training both the top and bottom PixelSNAIL networks. However, I can see that the generated samples have arbitrary shapes and do not look like the geometric training dataset. Did you see any improvement in the generated samples after the images you posted? I am curious to know since I am still struggling to get reasonable results from my PixelSNAIL training.

sbhadra2020 avatar Apr 08 '21 06:04 sbhadra2020

@sbhadra2020 Pity that after a few days I posted this question, it occurred to me that VQ-VAE2 seemed to be ill-suited for the further application in my research project. The details are somewhat complicated. Thus I switched to the original VQ-VAE model and the data generation function of it was not put to use. So I did not dive deeper into the improvement of image generation.

ZhanYangen avatar Apr 08 '21 06:04 ZhanYangen

@rosinality @sbhadra2020 @ZhanYangen Hi,everyone!It seems that I meet the same problem. The reconstruction results are reasonable. However, the sample results are strange. So is the problem the top code?What changes should I do?

Sample‘s results are as follows: image

ZhouCX117 avatar May 19 '21 04:05 ZhouCX117