Question regarding the noise scheduler and training objectives
Thank you for open source the code for Wan video! The quality is truly amazing. I really have fun using the model to generate all kinds of videos. And they are all high quality!
I have one question regarding training the model. Specifically the noise schedule part. I read the technical report and the paper states that Wan is trained with the rectified flow objectives:
$x_t = t x_1 + (1-t) x_0$
Thus the ground truth velocity $v_t = x_1 - x_0$ and the model's objective is trying to predict such velocity given the context, timestep, and $x_t$.
But when I tried to train the TI2V-5B model, I found that the FlowMatchScheduler has different implementation. For instance, the add_noise and training_target here: https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/schedulers/flow_match.py#L94-L105
So I am wondering is this the same scheduler that was used to train the model released in the repo of Wan 2.2?
Thank you so much!
@yccyenchicheng
$x_1$ is the noise, and $x_0$ is the ground-truth vae embedding.
In training_target, we have target = noise - sample, corresponding to $x_1-x_0$.
In add_noise, we have sample = (1 - sigma) * original_samples + sigma * noise,corresponding to $x_t=(1-t)x_0+t x_1$.
This is a general setting for all models (FLUX, Wan and Qwen-Image).