consistency_models icon indicating copy to clipboard operation
consistency_models copied to clipboard

[Discussion] How to understand Consistency Training (CT) in isolation

Open cantabile-kwok opened this issue 3 years ago • 4 comments

Thanks for the brilliant work! I am reading this legendary paper and get this question that I want to discuss here.

The paper starts at introducing a new method to distill knowledge from a trained score-based model. By Eq. (7) one could easily learn that the function $\boldsymbol f_\theta$ maps two different points to the same result. These two points inherently lie on the same ODE trajectory, as ${\hat{\mathbf x} {t_n}}^{\phi}$ comes from $\mathbf x_{t_{n+1}}$ by a step of score-based ODE. In this way, the function learns to map all the points on an ODE trajectory to the same point, so it can be used to generate data by one direct step from the initial point (Gaussian noise). So far this makes perfect sense to me.

By then the paper starts to introduce "training consistency models in isolation", and the training objective remains almost the same except that the two points become $\mathbf x+{t_{n+1}}\mathbf z$ and $\mathbf x+t_n\mathbf z$. These two points obviously cannot lie in the same ODE trajectory. Otherwise, the ODE trajectory becomes a straight line as they are using the same $\mathbf z$. Eq. (9) states the relationship between loss values of Consistency Distillation and Consistency Training, but this comparison is based on the expectation of all data $\mathbf x$. If my understanding is correct, then when we are using Eq. (10) to train a generative model, the function $\boldsymbol f$ is not intended to learn as the same as in consistency distillation, i.e. map all points in an ODE trajectory to its starting point. If so, what is Eq. (10) really doing and is Fig. 2 still valid in this sense?

I know this is not directly related to this implementation but I'm looking forward to any hint!

cantabile-kwok avatar Mar 17 '23 17:03 cantabile-kwok

I don't have the answer but I have a similar observation. Due to the problem you elaborated, the CT mode is not as good as CD theoretically. And based on their experiments, results from CT are worse than CD too. So the question is, why do we need CT if CD is available? Even if a pre-trained diffusion model is not readily available, the overhead to train a diffusion model first before CD, comparing to CT from random start, is worth it as you get better 1-step (or few-steps) model, plus you get a diffusion model as a high-performance (but slower) alternative

zhihongp avatar Mar 22 '23 21:03 zhihongp

Just read this paper and stumbled into this repo. My very personal understanding is that CT does not concern an ODE trajectory from its training, and its loss is not directly defined by minimizing the same "consistency loss" of CD as in Eq. (7). And intuitively Eq. (10) basically tries to remove any noise added to a data sample x AND tries to achieve consistent results when the starting points are sufficiently close (i.e., $t_{n+1}$ and $t_{n}$). That being said, since Eq. (9) guarantees $L_{CT}$ should be very close to $L_{CD}$ (note their definition uses the same network $f_\theta$), when CT converges, assuming you've trained a separate score-based diffusion model sampled some ODE trajectories using it, $f_\theta$ trained from CT should still be able to achieve the mapping in Fig. 2. However, I myself still don't have a good explanation of why CT seems to perform worse than CD theoretically, aside from CD effectively involves more training (parameters).

Graphi07 avatar Apr 04 '23 07:04 Graphi07

@zhihongp that's a good point. I'm not sure about why CT is not as good empirically, but "theoretically", it should have no difference (as two loss term has $o(t)$ that becomes 0 as $t \rightarrow 0$ ). That being said, the paper has shown us (for the first time) that training singe-step probability flow ODE integrator is possible, and it is perhaps left as a future work to tighten the gap.

One possible theory is that, in general, you gain performance even simply by self-distillation. I believe there needs to be more work done on demystifying why this even works : distilling oneself to the same architecture improves performance.

So, given that this phenomena is a thing, we might see why CT might underperform CD in the paper. CD apparently has better (less bias + less variance) signal of the score : it simply learned the score! So the training is significantly easier than estimating it via unbiased score estimator :

$$ \nabla \log p = - \mathbb{E} [ \frac{x_t - x}{t^2} | x_t ]
$$

I want to remind you that under the hood, we are taking the double estimation here. Minibatch Gradient itself is just unbiased estimator of the full-batch gradient, so it's just much easier to learn from the score itself.

cloneofsimo avatar Apr 11 '23 15:04 cloneofsimo

Empirically, similar phenomena happens all the time in deep learning, where...

  • Self-supervision with data A and using that pretrained model to model A (https://arxiv.org/abs/2209.14389)
  • Overparameterization and compression is better then default training (https://arxiv.org/abs/2012.08749)
  • making auxiliary label on the same data just to use it as a good signal on the exact same data to improve performance (https://arxiv.org/abs/2104.10858)

cloneofsimo avatar Apr 11 '23 15:04 cloneofsimo