StableVSR Questions about code and some SD start knowledge

Hi, thanks for your patience.

I want to know the difference between the two function get_velocity and get_approximated_x0 in scheduler/ddpm_scheduler.py with the v_prediction type in step function in line 356. Seems they all predict the denoised latents, but why thet are called in different way.

Dec 09 '24 11:12 stillbetter

Another question is, since we can get the pred_original_sample in every step in pipeline, why we not take it directly, instead keep the denoising step by step.

This may be irrelavent to the paper, but I wat truly confused, but cant get an proper answer. It would be great if you can explain it. Thanks

Dec 09 '24 12:12 stillbetter

Hi,

I want to know the difference between the two function get_velocity and get_approximated_x0 in scheduler/ddpm_scheduler.py with the v_prediction type in step function in line 356. Seems they all predict the denoised latents, but why thet are called in different way.

get_approximated_x0 does exactly the same as v_prediction type in line 410. A different function was implemented to avoid calling the step function of the scheduler.

The implementation of get_velocity is very similar to get_approximated_x0 but has a different purpose. Indeed, you can see that the equation velocity = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample is different from pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output (i.e., the coefficients of sample and noise are inverted).

Another question is, since we can get the pred_original_sample in every step in pipeline, why we not take it directly, instead keep the denoising step by step.

Because pred_original_sample is just an approximation of the final latent. It is computed by combining the noise predicted by the UNet and the noisy latent that is progressively refined. When t is large (e.g., 900), the latent is still very noisy and the noise predicted by the UNet may contain errors. As a consequence, the approximation of x0 is not good enough to be the final output (refer to Figure 3 of the paper). By adopting a step-by-step denoising, the current latent becomes progressively better (i.e., with less noise) and the noise predicted by the UNet is more accurate. In addition, this step-by-step denoising allows us to exploit the bidirectional strategy proposed to ensure temporal consistency.

Dec 09 '24 14:12 claudiom4sir

Greatly thanks for your reply! Exciting to cry~

So can I just treat v_prediction as the prediction object is a denoised latents, and epsilon as the object is noise?
I compare the get_velocity and get_approximated_x0. Since get_approximated_x0 is the same as v_prediction type . And in train.py line 1004, v_prediction corresponding to get_velocity function. But in step function of scheduler line 410, v_prediction relates to an inverted one. Why their return back are inverted? For get_approximated_x0, I understand the return thing is a noised latent minus a estimated noise so we can get a 'clean' latens of $\widetilde{x_0}$. But for get_velocity, I can't take intuitively the meaning of a noise minus a latents.

Dec 10 '24 12:12 stillbetter

Well, let me make Quest.2 clear. Why the a same v_prediction corresponding to two inverted formulars, in train and step func.

Dec 10 '24 12:12 stillbetter