The performance of the cogvideo1.5-I2V model shows a significant difference between SAT and Diffusers.
System Info / 系統信息
official environment
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
official script
Expected behavior / 期待表现
I tested the cogvideo1.5 model using the SAT and Diffusers methods on some anime images with the official script. However, I noticed a significant difference in their results: the SAT model performed well, while the Diffusers model produced very poor results.
Could you help me understand why this happens?
The first row is SAT:
https://github.com/user-attachments/assets/559f8c18-070c-4f3f-8f0f-a51d5586d5e0
https://github.com/user-attachments/assets/62e4ee79-5ba7-4267-a94e-11e401002547
The second row is Diffusers.
https://github.com/user-attachments/assets/bc97c7c3-b4d2-4365-a7fe-7aa34589914a
https://github.com/user-attachments/assets/fd410e6d-cc54-466e-9ede-741d6eb51b92
I have the same question, 1.5 at some point performs worse than 1.0, 1.5 get distorted images very often
I have the same question, 1.5 at some point performs worse than 1.0, 1.5 get distorted images very often
Which method do you use? SAT model or Diffusers model ?
Diffusers model Diffusers model in comfyui,
Diffusers model Diffusers model in comfyui,
OK I hope the project owner can provide some advice promptly @
I encountered the same issue, where the first frame was too bright and the later frames were blurry, causing poor temporal consistency. I found that the problem was in https://github.com/huggingface/diffusers/blob/8eb73c872afbe59abab4580aaa591a9851a42e6d/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L385C9-L390C78. During the training of CogVideo1.5, the self.vae_scaling_factor_image was not multiplied, but it was divided by self.vae_scaling_factor_image during inference. The correct code should be:
if not self.vae.config.invert_scale_latents: image_latents = self.vae_scaling_factor_image * image_latents else: # This is awkward but required because the CogVideoX team forgot to multiply the # scaling factor during training :) image_latents = 1.0 * image_latents
After making this change, CogVideo1.5 started generating videos correctly.
I encountered the same issue, where the first frame was too bright and the later frames were blurry, causing poor temporal consistency. I found that the problem was in https://github.com/huggingface/diffusers/blob/8eb73c872afbe59abab4580aaa591a9851a42e6d/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L385C9-L390C78. During the training of CogVideo1.5, the self.vae_scaling_factor_image was not multiplied, but it was divided by self.vae_scaling_factor_image during inference. The correct code should be:
if not self.vae.config.invert_scale_latents: image_latents = self.vae_scaling_factor_image * image_latents else: # This is awkward but required because the CogVideoX team forgot to multiply the # scaling factor during training :) image_latents = 1.0 * image_latents
After making this change, CogVideo1.5 started generating videos correctly.
so is there gonna be a corrected version of 1.5
vae_scaling_factor_image
It is an amazing find, I will try it. Thanks for your reply.
5 started generating videos correct
Great finding! I had the same issue running CogVideo1.5 with diffusers, the output videos were distorted, wobbly, and many times no motion at all.
So to summarize, am I understanding it correctly?
For CogvideoX-1.5-T2V: nothing needs to be changed.
For CogvideoX-1.5-I2V:
- In fine-tuning script, in order to be consistent with how CogVideoX team trained the model, image_latents should not be scaled at all during training. Note that the video_latents should still be scaled.
- In pipeline (inference) script,
image_latentsshould not be scaled at all either.
So to summarize, am I understanding it correctly?
For CogvideoX-1.5-T2V: nothing needs to be changed.
For CogvideoX-1.5-I2V:
- In fine-tuning script, in order to be consistent with how CogVideoX team trained the model, image_latents should not be scaled at all during training. Note that the video_latents should still be scaled.
- In pipeline (inference) script,
image_latentsshould not be scaled at all either.
@Yuancheng-Xu I can confirm that the image_latents should not be scaled in CogvideoX-1.5-I2V.
However, from https://github.com/THUDM/CogVideo/blob/main/finetune/models/cogvideox_i2v/lora_trainer.py#L101
It seems that the video_latents should also not be scaled.