Question regarding the noise scheduler and training objectives

Open yccyenchicheng opened this issue 4 months ago • 1 comments

Thank you for open source the code for Wan video! The quality is truly amazing. I really have fun using the model to generate all kinds of videos. And they are all high quality!

I have one question regarding training the model. Specifically the noise schedule part. I read the technical report and the paper states that Wan is trained with the rectified flow objectives:

$x_t = t x_1 + (1-t) x_0$

Thus the ground truth velocity $v_t = x_1 - x_0$ and the model's objective is trying to predict such velocity given the context, timestep, and $x_t$.

But when I tried to train the TI2V-5B model, I found that the FlowMatchScheduler has different implementation. For instance, the add_noise and training_target here: https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/schedulers/flow_match.py#L94-L105

So I am wondering is this the same scheduler that was used to train the model released in the repo of Wan 2.2?

Thank you so much!

Oct 28 '25 09:10 yccyenchicheng

@yccyenchicheng

$x_1$ is the noise, and $x_0$ is the ground-truth vae embedding.

In training_target, we have target = noise - sample, corresponding to $x_1-x_0$.

In add_noise, we have sample = (1 - sigma) * original_samples + sigma * noise，corresponding to $x_t=(1-t)x_0+t x_1$.

This is a general setting for all models (FLUX, Wan and Qwen-Image).

Oct 30 '25 06:10 Artiprocher