DiffSynth-Studio Question about Wan 2.2 I2V Training

When performing inference using the high-noise model and low-noise model with LoRA applied, the generated video exhibits sudden brightness changes (similar to turning off a light), and the video becomes extremely dark.

Aug 11 '25 09:08 DuanCB

same problem

Aug 14 '25 03:08 jxjx12138

I try to train Wan2.2-5B model. And I change some structure of the model. But I met a similar problem that the video became brighter after the first frame. I really don't know why this happen? Training code wrong or my method wrong?

Aug 15 '25 14:08 LazySheeeeeep

I try to train Wan2.2-5B model. And I change some structure of the model. But I met a similar problem that the video became brighter after the first frame. I really don't know why this happen? Training code wrong or my method wrong?

I was also training an architecture-modified Wan2.2-TI2V-5B model, and I haven't encountered such an issue. Maybe there are some bugs in your reference video frame preprocessing stage?

Aug 17 '25 21:08 EigenTom

I try to train Wan2.2-5B model. And I change some structure of the model. But I met a similar problem that the video became brighter after the first frame. I really don't know why this happen? Training code wrong or my method wrong?

I was also training an architecture-modified Wan2.2-TI2V-5B model, and I haven't encountered such an issue. Maybe there are some bugs in your reference video frame preprocessing stage?

I checked my video processing method, and although I added some new processing, the overall data processing is basically the same as the original. May I ask if you could explain the details of your data processing? Another question is, did you modify the framework code provided by Diffsynth during the training process? For example, replacing the first frame of latents in training_loss function?

Aug 18 '25 05:08 LazySheeeeeep

I try to train Wan2.2-5B model. And I change some structure of the model. But I met a similar problem that the video became brighter after the first frame. I really don't know why this happen? Training code wrong or my method wrong?

I was also training an architecture-modified Wan2.2-TI2V-5B model, and I haven't encountered such an issue. Maybe there are some bugs in your reference video frame preprocessing stage?

I checked my video processing method, and although I added some new processing, the overall data processing is basically the same as the original. May I ask if you could explain the details of your data processing? Another question is, did you modify the framework code provided by Diffsynth during the training process? For example, replacing the first frame of latents in training_loss function?

My custom model training uses conventional real-world videos, therefore I reused the same reference video frame preprocessing code which is basically normalize the pixel density dist from [0, 255] to [-1, 1]. Random cropping does not affect the training results either.

I indeed modified the training code and performed multiple embedding fusions with both the noisy latent and patchified tokens, but I did not encountered such weird visual artifact experienced by you, although I believe certain modifications to the noisy latent will indeed contribute to such problem in some ways.

If interested, you may contact me via email for in-depth discussion.

Aug 19 '25 04:08 EigenTom

@LazySheeeeeep Same problem, have your solved this?

Oct 23 '25 16:10 ZhaoJingjing713