Questions about code and some SD start knowledge
Hi, thanks for your patience.
I want to know the difference between the two function get_velocity and get_approximated_x0 in scheduler/ddpm_scheduler.py with the v_prediction type in step function in line 356. Seems they all predict the denoised latents, but why thet are called in different way.
Another question is, since we can get the pred_original_sample in every step in pipeline, why we not take it directly, instead keep the denoising step by step.
This may be irrelavent to the paper, but I wat truly confused, but cant get an proper answer. It would be great if you can explain it. Thanks
Hi,
I want to know the difference between the two function get_velocity and get_approximated_x0 in scheduler/ddpm_scheduler.py with the v_prediction type in step function in line 356. Seems they all predict the denoised latents, but why thet are called in different way.
get_approximated_x0 does exactly the same as v_prediction type in line 410. A different function was implemented to avoid calling the step function of the scheduler.
The implementation of get_velocity is very similar to get_approximated_x0 but has a different purpose. Indeed, you can see that the equation velocity = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample is different from pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output (i.e., the coefficients of sample and noise are inverted).
Another question is, since we can get the pred_original_sample in every step in pipeline, why we not take it directly, instead keep the denoising step by step.
Because pred_original_sample is just an approximation of the final latent. It is computed by combining the noise predicted by the UNet and the noisy latent that is progressively refined. When t is large (e.g., 900), the latent is still very noisy and the noise predicted by the UNet may contain errors. As a consequence, the approximation of x0 is not good enough to be the final output (refer to Figure 3 of the paper). By adopting a step-by-step denoising, the current latent becomes progressively better (i.e., with less noise) and the noise predicted by the UNet is more accurate. In addition, this step-by-step denoising allows us to exploit the bidirectional strategy proposed to ensure temporal consistency.
Greatly thanks for your reply! Exciting to cry~
- So can I just treat
v_predictionas the prediction object is a denoised latents, andepsilonas the object is noise? - I compare the
get_velocityandget_approximated_x0. Sinceget_approximated_x0is the same asv_predictiontype . And in train.py line 1004,v_predictioncorresponding toget_velocityfunction. But in step function of scheduler line 410,v_predictionrelates to an inverted one. Why their return back are inverted? Forget_approximated_x0, I understand the return thing is a noised latent minus a estimated noise so we can get a 'clean' latens of $\widetilde{x_0}$. But forget_velocity, I can't take intuitively the meaning of a noise minus a latents.
Well, let me make Quest.2 clear. Why the a same v_prediction corresponding to two inverted formulars, in train and step func.