CogVideo The performance of the cogvideo1.5-I2V model shows a significant difference between SAT and Diffusers.

System Info / 系統信息

official environment

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

official script

Expected behavior / 期待表现

I tested the cogvideo1.5 model using the SAT and Diffusers methods on some anime images with the official script. However, I noticed a significant difference in their results: the SAT model performed well, while the Diffusers model produced very poor results.

Could you help me understand why this happens?

The first row is SAT:

https://github.com/user-attachments/assets/559f8c18-070c-4f3f-8f0f-a51d5586d5e0

https://github.com/user-attachments/assets/62e4ee79-5ba7-4267-a94e-11e401002547

The second row is Diffusers.

https://github.com/user-attachments/assets/bc97c7c3-b4d2-4365-a7fe-7aa34589914a

https://github.com/user-attachments/assets/fd410e6d-cc54-466e-9ede-741d6eb51b92

Dec 02 '24 09:12 lovejing0306

I have the same question, 1.5 at some point performs worse than 1.0, 1.5 get distorted images very often

Dec 02 '24 13:12 huanggou666

I have the same question, 1.5 at some point performs worse than 1.0, 1.5 get distorted images very often

Which method do you use? SAT model or Diffusers model ?

Dec 03 '24 02:12 lovejing0306

Diffusers model Diffusers model in comfyui,

Dec 03 '24 03:12 huanggou666

Diffusers model Diffusers model in comfyui,

OK I hope the project owner can provide some advice promptly @

Dec 03 '24 05:12 lovejing0306

I encountered the same issue, where the first frame was too bright and the later frames were blurry, causing poor temporal consistency. I found that the problem was in https://github.com/huggingface/diffusers/blob/8eb73c872afbe59abab4580aaa591a9851a42e6d/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L385C9-L390C78. During the training of CogVideo1.5, the self.vae_scaling_factor_image was not multiplied, but it was divided by self.vae_scaling_factor_image during inference. The correct code should be:

if not self.vae.config.invert_scale_latents: image_latents = self.vae_scaling_factor_image * image_latents else: # This is awkward but required because the CogVideoX team forgot to multiply the # scaling factor during training :) image_latents = 1.0 * image_latents

After making this change, CogVideo1.5 started generating videos correctly.

Dec 18 '24 08:12 liuxiaoyu1104

I encountered the same issue, where the first frame was too bright and the later frames were blurry, causing poor temporal consistency. I found that the problem was in https://github.com/huggingface/diffusers/blob/8eb73c872afbe59abab4580aaa591a9851a42e6d/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L385C9-L390C78. During the training of CogVideo1.5, the self.vae_scaling_factor_image was not multiplied, but it was divided by self.vae_scaling_factor_image during inference. The correct code should be:

if not self.vae.config.invert_scale_latents: image_latents = self.vae_scaling_factor_image * image_latents else: # This is awkward but required because the CogVideoX team forgot to multiply the # scaling factor during training :) image_latents = 1.0 * image_latents

After making this change, CogVideo1.5 started generating videos correctly.

so is there gonna be a corrected version of 1.5

Dec 18 '24 09:12 huanggou666

vae_scaling_factor_image

It is an amazing find, I will try it. Thanks for your reply.

Dec 20 '24 02:12 lovejing0306

5 started generating videos correct

Great finding! I had the same issue running CogVideo1.5 with diffusers, the output videos were distorted, wobbly, and many times no motion at all.

Dec 25 '24 02:12 jinqiupeter

So to summarize, am I understanding it correctly?

For CogvideoX-1.5-T2V: nothing needs to be changed.

For CogvideoX-1.5-I2V:

In fine-tuning script, in order to be consistent with how CogVideoX team trained the model, image_latents should not be scaled at all during training. Note that the video_latents should still be scaled.
In pipeline (inference) script, image_latents should not be scaled at all either.

Jan 18 '25 19:01 Yuancheng-Xu

So to summarize, am I understanding it correctly?

For CogvideoX-1.5-T2V: nothing needs to be changed.

For CogvideoX-1.5-I2V:

In fine-tuning script, in order to be consistent with how CogVideoX team trained the model, image_latents should not be scaled at all during training. Note that the video_latents should still be scaled.

In pipeline (inference) script, image_latents should not be scaled at all either.

@Yuancheng-Xu I can confirm that the image_latents should not be scaled in CogvideoX-1.5-I2V.

However, from https://github.com/THUDM/CogVideo/blob/main/finetune/models/cogvideox_i2v/lora_trainer.py#L101 It seems that the video_latents should also not be scaled.

Feb 21 '25 14:02 hbb1