vq-vae-2-pytorch icon indicating copy to clipboard operation
vq-vae-2-pytorch copied to clipboard

Image File Extension

Open SURABHI-GUPTA opened this issue 4 years ago • 29 comments
trafficstars

@rosinality I have a doubt related to the image format used for training the model. As FFHQ is a collection of high dimensional PNG images, does training with LFW dataset, images are JPG of size 160X160 affects compression ratio/or performance of your model?

SURABHI-GUPTA avatar Dec 19 '20 14:12 SURABHI-GUPTA

It will be depend on jpeg compression rates, but I don't think it will affect the quality much.

rosinality avatar Dec 19 '20 17:12 rosinality

@rosinality My original image and the reconstructed image using the model have no difference in their size. It doesn't seem that image is getting compressed. Is there anything I am misunderstanding?

SURABHI-GUPTA avatar Dec 21 '20 14:12 SURABHI-GUPTA

Latent code is compressed representation of images.

rosinality avatar Dec 21 '20 17:12 rosinality

@rosinality okay. I want to check the size of the latent map for top and bottom. How to save it ?

SURABHI-GUPTA avatar Dec 22 '20 09:12 SURABHI-GUPTA

You don't need save it. Size of latent codes is 32x32 & 64x64 with 512 discrete codes, so size of latent codes will be 9 * (32 * 32 + 64 * 64) bits.

rosinality avatar Dec 22 '20 10:12 rosinality

@rosinality If I want to check the final compressed size to evaluate my model with another compression algo, what will be the size? Will it be 9 * (32 * 32 + 64 * 64) bits ?

SURABHI-GUPTA avatar Dec 22 '20 10:12 SURABHI-GUPTA

Of course it will be reduced if you uses additional compression.

rosinality avatar Dec 22 '20 10:12 rosinality

@rosinality thanks for clearing my doubts.

SURABHI-GUPTA avatar Dec 22 '20 10:12 SURABHI-GUPTA

@rosinality So, what I understand from the above discussion is,

For any given image of size 256X256, with the size of latent codes 32x32 & 64x64 with 512 discrete codes, I will always end up with a compressed size of 9 * (32 * 32 + 64 * 64) = 46080 bits and bbp = 46080/(256X256) = 0.7. Am I correct ?

SURABHI-GUPTA avatar Dec 23 '20 04:12 SURABHI-GUPTA

It is upper bound of bits. If model uses less than 512 latent codes, like 250, or the number of latent codes need for the specific images is less than 512, than it could be reduced. But it is depend on how you defines the size of the latent codebooks. (Upper bound, trained model specific, or image specific.)

rosinality avatar Dec 23 '20 04:12 rosinality

@rosinality so what we define as the number of embedding vectors, K=512, it is also the upper bound? How can we find out how many latent codes are used? It is important to know that for calculating the number of bits .

SURABHI-GUPTA avatar Dec 23 '20 05:12 SURABHI-GUPTA

Yes, 512 will be upper bound. You can do it like this for the number of latent codes actually used.

_, _, _, id_t, id_b = vqvae.encode(img)
# this will corresponds to the number of distinct latent codes used in top/bottom latent map
torch.unique(id_t).shape, torch.unique(id_b).shape 

rosinality avatar Dec 23 '20 05:12 rosinality

@rosinality although the number of unique values is less than 512, the number of bits required is still 9 bits and hence the image will be of 46080 bits only, right ? Okay, let's say I want to make my top latent map of size 16X16 rather than 32X32, how to do that ? Also, I tried to look at the loss vs epoch curve, it doesn't seem to reduce smoothly. At some points, there is sudden increase in loss and then decrease.

SURABHI-GUPTA avatar Dec 26 '20 13:12 SURABHI-GUPTA

If the number of uniques codes is less than 256, then 8 bits will be enough. And if you want to reduce latent map sizes, then you can just add downsample & upsample layers. Also, VQ-VAE training seems like that inherently not very smooth. It maybe because it is constructed on discretization and k-means clustering.

rosinality avatar Dec 26 '20 13:12 rosinality

@rosinality I tried with K=128, then also the number of unique codes is less than 128. So, if I reduce the K value, the number of unique codes also decreases in the same ratio. If I talk about testing the model, then as you mentioned that if the number of uniques codes is less than 256, then 8 bits will be enough, so it means these 8 bits will be counted for calculating the size, although during training there could be more than 256 unique codes with K=512.

SURABHI-GUPTA avatar Dec 26 '20 14:12 SURABHI-GUPTA

@rosinality How can we calculate the bpp of the original/uncompressed image? When we talk about this formula, compression ratio=S_uncompressed/S_compressed, what do we exactly mean by S_uncompressed here? I am confused with file size that we get using "os.path.getsize(full_path)" or its wXhXcX8 ?

SURABHI-GUPTA avatar Dec 29 '20 07:12 SURABHI-GUPTA

Isn't bpp corresponds to bits per pixel? Then it would be (total bits for the image) / (# pixel of the image). If we think (R, G, B) pair is 1 pixel then denominator would be 256 * 256, if we consider it independently, then it would be 3 * 256 * 256.

If you want to calculate length of uncompressed bit sequences, then it would be 3 * 256 * 256 * 8. Actual size of images will not be accurate as it is generally compressed, and it would contain headers.

rosinality avatar Dec 29 '20 11:12 rosinality

@rosinality sorry for the mistake, I am talking about CR, compression ratio=S_uncompressed/S_compressed. How to find out S_uncompressed for calculating compression ratio? For an RGB image of 160X160 pixels, I calculated 'os.path.getsize(full_image_path)', and I get 3426 B as size. I didn't understand how to get this value using your above calculation.

SURABHI-GUPTA avatar Dec 29 '20 14:12 SURABHI-GUPTA

I think you need to check how to calculate CR as I don't know that problem much. Anyway, uncompressed RGB image need to use 3 * 8 * H * W bits. As I noted generally image files are already compressed so actual file sizes will not corresponds to this. (It will be close if the format of image is BMP, though.)

rosinality avatar Dec 29 '20 14:12 rosinality

@rosinality btw I am talking about this image. original_1

SURABHI-GUPTA avatar Dec 29 '20 15:12 SURABHI-GUPTA

Yes, and it is jpeg compressed. You would find that size of it will be close to 160 x 160 x 3 bytes if you convert it to bmp.

rosinality avatar Dec 29 '20 15:12 rosinality

@rosinality okay... I got some clarity now.. thank you for the quick response.

SURABHI-GUPTA avatar Dec 29 '20 15:12 SURABHI-GUPTA

@rosinality I have some query regarding your code. qt, qb, di, _id_t, _id_b = vqvae.encode(img_t)

When I try reconstructions using quantized embedding (qb) and quantized latent map using codebook (_id_b), it should be same, right? However I get different results.

decoded_sample_bot = vqvae.decode(qt*0, qb).detach()[0].squeeze(0) decoded_sample_bot = vqvae.decode_code(id1_top, _id_b).detach()[0].squeeze(0), here id1_top is array of 1s

SURABHI-GUPTA avatar Jan 03 '21 14:01 SURABHI-GUPTA

@rosinality please help. For bottom code, I am getting different reconstructions. With decoded_sample_bot = vqvae.decode_code(id1_top, _id_b).detach()[0].squeeze(0), I get decoded_sample_b

but with decoded_sample_bot = vqvae.decode(qt*0, qb).detach()[0].squeeze(0), I get decoded_sample_b

SURABHI-GUPTA avatar Jan 04 '21 09:01 SURABHI-GUPTA

qt * 0 and embeddings from id 1 (id1_top) will not be same.

rosinality avatar Jan 04 '21 09:01 rosinality

@rosinality considering id1_top=torch.zeros([1,10,10],dtype=torch.long), it also implies that for bottom we are making top as 0. how can we get the correct result?

SURABHI-GUPTA avatar Jan 04 '21 11:01 SURABHI-GUPTA

No, it is not. As id (code) is mapped with embeddings, if you uses code filled with zeros, then it will be mapped to latent maps that filled with 0th embeddings. (which is nonzero.) Definitely it is different from latent maps that filled with zeros.

rosinality avatar Jan 04 '21 11:01 rosinality

@rosinality yes, that makes sense, but for top latent embedding, I get correct reconstruction.

SURABHI-GUPTA avatar Jan 04 '21 12:01 SURABHI-GUPTA

I don't know it is fair to say it is correct reconstructions. Anyway zero latent code != zero latent embedding.

rosinality avatar Jan 04 '21 13:01 rosinality