DenseFlow icon indicating copy to clipboard operation
DenseFlow copied to clipboard

Confused by BPD, Hardware and FID?

Open alexm-gc opened this issue 2 years ago • 2 comments

I was comparing DenseFlow against VDM on ImageNet64x64.

DenseFlow: 3.35 BPD, 130M, 1 V100 ~2 weeks VDM: 3.4 BPD, ?M, 128 TPUv3 for ?weeks?

It looks like DenseFlow gets better BPD with ~100x less compute, at the cost of worse FID.

Question 0. Do you know what the training loss of DenseFlow/VDM? I imagine low training loss leads to good FID (perhaps VDM has train=1.0 bpd and valid=3.4 bpd, while DenseFlow has train=3.3 and valid=3.4).

Question 1. Did you make any test cases for the BPD computation? It just sounds too good to be true that we can get better BPD with 100x less compute.

Question 2. It may be there is a trade-off between BPD and FID. That is, DenseFlow get good BPD but bad FID, while VDM gets good FID and worse BPD. Do you believe this is the case? If so, what do you believe cause this phenomenon?

alexm-gc avatar May 14 '22 10:05 alexm-gc

Hi @alexm-gc, sorry for the late response.

A0. DenseFlow-74-10 achieves BPD around 3.23 on the train set. I don't know the value of train BPD for VDM. Perhaps you should ask the authors.

A1. The code is open-sourced so that everyone can validate the results independently. I'd say the right question is whether we really need dozens of GPUs to model such small images (32x32 or 64x64 pixels).

A2. I'm not sure about the BPD/FID tradeoff. Proving such statement would be a great research direction! Perhaps some models can be very good at approximating the data distribution but hard to sample. I'd say this is the case for NFs since we need to sample a very large latent representation.

matejgrcic avatar Jun 03 '22 14:06 matejgrcic

Thanks for the reply!

The code is open-sourced so that everyone can validate the results independently.

I'll take a look!

I'd say the right question is whether we really need dozens of GPUs to model such small images (32x32 or 64x64 pixels).

I agree that flop/memory-wise 64x64 is much easier than say 256x256.

That said, I do not think a single GPU is sufficient if we are to train a generative model on ImageNet64x64! I believe such a generative model needs to incorporate:

(1) physics like light and positions of objects in 3d space (2) the specific objects like dogs/cats/humans.

I think this remains difficult at 64x64. Note that both DallE-2 and Image GPT where trained at 64x64 resolution! (DallE2 uses a separate upscaling 64x64->256x256->1024x1024).

alexm-gc avatar Jun 10 '22 13:06 alexm-gc

Thanks for the reply!

The code is open-sourced so that everyone can validate the results independently.

I'll take a look!

I'd say the right question is whether we really need dozens of GPUs to model such small images (32x32 or 64x64 pixels).

I agree that flop/memory-wise 64x64 is much easier than say 256x256.

That said, I do not think a single GPU is sufficient if we are to train a generative model on ImageNet64x64! I believe such a generative model needs to incorporate:

(1) physics like light and positions of objects in 3d space (2) the specific objects like dogs/cats/humans.

I think this remains difficult at 64x64. Note that both DallE-2 and Image GPT where trained at 64x64 resolution! (DallE2 uses a separate upscaling 64x64->256x256->1024x1024).

Hi, I think the reason why DenseFlow has such a good BPD on ImageNet32/ImageNet64 with distinctly lower computational cost is that the wrong version of downsampled ImageNet was used. I have recently uploaded the code of our ICML2023 paper Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs (https://github.com/thu-ml/i-DODE), where this question is emphasized as:

There are two different versions of ImageNet32 dataset. For fair comparisons, we use both versions of ImageNet32, one is downloaded from https://image-net.org/data/downsample/Imagenet32_train.zip, following Flow Matching [3], and the other is downloaded from http://image-net.org/small/train_32x32.tar (old version, no longer available), following ScoreSDE and VDM. The former dataset applies anti-aliasing and is easier for maximum likelihood training.

Clearly, DenseFlow chose the new version of ImageNet32/64 (https://github.com/matejgrcic/DenseFlow/blob/473220a9c02b262b481fbaa50a947e40bad3f99c/denseflow/data/datasets/image/imagenet32.py), which is in favor of the BPD. Therefore, I suggest the author clarify this and remove the BPD result from the rank list (https://paperswithcode.com/paper/densely-connected-normalizing-flows), where other methods are using the old version ImageNet and the comparison is unfair and confusing.

zhengkw18 avatar Nov 29 '23 06:11 zhengkw18

Thanks for doing this, looks like you managed to get to the bottom of a tricky problem! I'll try to verify at some point, a bit busy atm. If what you're saying is true, we should 100% get someone to update the benchmark.

AlexanderMath avatar Dec 01 '23 11:12 AlexanderMath