VAR icon indicating copy to clipboard operation
VAR copied to clipboard

The performance of VAR Tokenizer

Open youngsheen opened this issue 10 months ago • 5 comments

What is the performance of VAR tokenizer? As it is trained on OpenImages while some other VQGAN tokenizers are trained on ImageNet only. I wonder the gain of performance brought by the pre-trained data.

youngsheen avatar Apr 09 '24 18:04 youngsheen

hi @youngsheen, more VQVAE evals are coming in the next paper update.

We trained VQVAE on OpenImages refer to VQGAN (see https://github.com/CompVis/taming-transformers?tab=readme-ov-file#overview-of-pretrained-models).

We actually found training vqvae directly on ImageNet yields slightly better results than OpenImages. We kept using OpanImages to stay aligned with our VQGAN baseline.

keyu-tian avatar Apr 09 '24 19:04 keyu-tian

Does the tokenizer able to do understanding?

luohao123 avatar Apr 11 '24 07:04 luohao123

I use vqvae in var, and the image produced by encoding and decoding is compared with the original image as follows. image image Is this because the generalization performance of vqvae is not good enough?

huxiaotaostasy avatar Apr 11 '24 09:04 huxiaotaostasy

@huxiaotaostasy please make sure you denormalize and clamp the output of VQVAE out by out = out.mul(0.5).add_(0.5).clamp_(0, 1).

keyu-tian avatar Apr 12 '24 13:04 keyu-tian

@luohao123 maybe you can create token maps (r1, r2, ..., rK) by repeating one index in [0, V-1] on all scales and then decode them to see how's the reconstructed image like.

keyu-tian avatar Apr 12 '24 13:04 keyu-tian