consistency_models Inconsistent loss term with paper

Hi,

Thanks for open-sourcing this wonderful project!

However, I notice in CT training, the loss term has the target model denoising for $x_{t_{n+1}}$ instead of $x_{t_n}$, which is different from the loss (target model denoising for $x_{t_n}$) stated in Alg.3 CT in the paper. Did I miss something, or this mutation does not matter?

Apr 16 '23 19:04 quantumiracle

There are some differences in the paper in general. For example the rho scheduling is reversed in the paper, the formula used in the code here is more similar to EDM.

There are other differences though, for example the method for adding noise when computing the distiller_target is slightly different.

It's hard to know if the paper or the code is the better approach. I'm more inclined to think that the code is more up to date, but I'm basing that only on the release date of the repo being after the paper publication date.

Apr 18 '23 13:04 thorinf

Did you guys understood the preconditioning of the time signal in the denoising method of the karras diffusion class? I also cannot find this equation in the paper:


  rescaled_t = 1000 * 0.25 * th.log(sigmas + 1e-44)
  model_output = model(c_in * x_t, rescaled_t, **model_kwargs)
  denoised = c_out * model_output + c_skip * x_t

from the denoising method: https://github.com/openai/consistency_models/blob/main/cm/karras_diffusion.py#L346

Without the 1000 it is the same noise conditioning used in the EDM preconditioning, but I don't understand the new factor of 1000.

Apr 18 '23 14:04 mbreuss

There are quite a few differences. I've raised an additional ticket and email Yang Song to hopefully know which is better.

https://github.com/openai/consistency_models/issues/18

I would need to check, but the scaling might be due to the how the temporal embedding is computed in the model. The EDM paper might be using a method that likes small floats, whereas something like a SinusoidalPE would prefer larger values.

Apr 18 '23 14:04 thorinf

Thanks for the info, that's good to hear! Let us know, when you hear something.

They use an MLP to encode the timestep in the unet. So large values should not be better. But maybe I am missing something there. Also the general preconditioning of the noise is not mentioned in the paper at all.

Apr 18 '23 14:04 mbreuss

I've not checked, but is the MLP following a concatenation of sin-cos values?

I think c_in scaling does make sense to use. x_t is going to be very large towards large t, so scaling it should keep the variance at a nicer scale for the NN.

Apr 18 '23 14:04 thorinf

Good point, I missed one part, where they are using a Sinusoidal Timestep Embedding to encode the timestep before the MLP: https://github.com/openai/consistency_models/blob/6d26080c58244555c031dbc63080c0961af74200/cm/nn.py#L119 So this could explain it.

Apr 18 '23 15:04 mbreuss

Hi @mbreuss @thorinf, Thank you for your insights. Were you able to get a response or understand the scaling factor.

Aug 01 '24 08:08 wasphulud

consistency_models consistency_models copied to clipboard

Inconsistent loss term with paper

consistency_models
consistency_models copied to clipboard