consistency_models icon indicating copy to clipboard operation
consistency_models copied to clipboard

Inconsistent loss term with paper

Open quantumiracle opened this issue 2 years ago • 8 comments

Hi,

Thanks for open-sourcing this wonderful project!

However, I notice in CT training, the loss term has the target model denoising for $x_{t_{n+1}}$ instead of $x_{t_n}$, which is different from the loss (target model denoising for $x_{t_n}$) stated in Alg.3 CT in the paper. Did I miss something, or this mutation does not matter?

quantumiracle avatar Apr 16 '23 19:04 quantumiracle

There are some differences in the paper in general. For example the rho scheduling is reversed in the paper, the formula used in the code here is more similar to EDM.

There are other differences though, for example the method for adding noise when computing the distiller_target is slightly different.

It's hard to know if the paper or the code is the better approach. I'm more inclined to think that the code is more up to date, but I'm basing that only on the release date of the repo being after the paper publication date.

thorinf avatar Apr 18 '23 13:04 thorinf

Did you guys understood the preconditioning of the time signal in the denoising method of the karras diffusion class? I also cannot find this equation in the paper:


  rescaled_t = 1000 * 0.25 * th.log(sigmas + 1e-44)
  model_output = model(c_in * x_t, rescaled_t, **model_kwargs)
  denoised = c_out * model_output + c_skip * x_t

from the denoising method: https://github.com/openai/consistency_models/blob/main/cm/karras_diffusion.py#L346

Without the 1000 it is the same noise conditioning used in the EDM preconditioning, but I don't understand the new factor of 1000.

mbreuss avatar Apr 18 '23 14:04 mbreuss

There are quite a few differences. I've raised an additional ticket and email Yang Song to hopefully know which is better.

https://github.com/openai/consistency_models/issues/18

I would need to check, but the scaling might be due to the how the temporal embedding is computed in the model. The EDM paper might be using a method that likes small floats, whereas something like a SinusoidalPE would prefer larger values.

thorinf avatar Apr 18 '23 14:04 thorinf

Thanks for the info, that's good to hear! Let us know, when you hear something.

They use an MLP to encode the timestep in the unet. So large values should not be better. But maybe I am missing something there. Also the general preconditioning of the noise is not mentioned in the paper at all.

mbreuss avatar Apr 18 '23 14:04 mbreuss

I've not checked, but is the MLP following a concatenation of sin-cos values?

I think c_in scaling does make sense to use. x_t is going to be very large towards large t, so scaling it should keep the variance at a nicer scale for the NN.

thorinf avatar Apr 18 '23 14:04 thorinf

Good point, I missed one part, where they are using a Sinusoidal Timestep Embedding to encode the timestep before the MLP: https://github.com/openai/consistency_models/blob/6d26080c58244555c031dbc63080c0961af74200/cm/nn.py#L119 So this could explain it.

mbreuss avatar Apr 18 '23 15:04 mbreuss

Hi @mbreuss @thorinf, Thank you for your insights. Were you able to get a response or understand the scaling factor.

wasphulud avatar Aug 01 '24 08:08 wasphulud