CogVideo Summary of CogVideoX-5B-I2V-v1.5 inference and fine-tuning about `vae_scaling_factor

贵模型研究人员：

您好！我在使用CogVideoX-5B-I2V-v1.5模型时遇到了一些问题，通过检索仓库内和相关仓库issue，有一些初步的解决方案，但总结之后，仍对如下内容有一些疑问，望得到解决。

SAT模型和diffusers模型存在差异问题
- 问题1：这种解决方案是否正确？
- 其他相关issue
  - https://github.com/THUDM/CogVideo/issues/570
  - https://github.com/a-r-r-o-w/finetrainers/issues/101
  - https://github.com/a-r-r-o-w/finetrainers/issues/110
- 表现：SAT模型和diffusers模型存在差异，diffusers模型第一帧之后颜色变稍微灰一点，模糊一点
- 原因：1.5版本的I2V diffusers模型官方在训练时没有乘上vae_scaling_factor_image系数
- 解决方案：需要手动修改源码，位置diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py
```
if not self.vae.config.invert_scale_latents:
    image_latents = self.vae_scaling_factor_image * image_latents
else:
    # image_latents = 1 / self.vae_scaling_factor_image * image_latents
    image_latents = 1.0 * image_latents
```
CogVideoX 1.5 diffusers LoRA Fine-tuning问题
- 问题2：如果上述解决方法正确的情况下，如何进行lora微调训练和推理
  - 方案一
    - lora微调训练时，原本lora微调训练代码中，手动修改去掉image latent乘上 self.vae_scaling_factor_image 系数相关代码
    - lora微调推理时，用 1.0 * image_latents
  - 方案二
    - lora微调训练时，保持原本lora微调训练代码不动 image latent的 self.vae_scaling_factor_image 系数
    - lora微调推理时，保持原版pipeline_cogvideox_image2video.py 不动，用 1 / self.vae_scaling_factor_image * image_latents
- 背景：微调的训练代码中，观察到所有lora微调的代码中都有image latent vae_scaling_factor相乘的的部分，也就是这里并没有忘记要乘系数，所以后面才需要除以这个系数，然后就等于微调的时候系数也是1.0了，（只是官方团队在预训练模型的时候没有乘系数？）
- 参考代码
  - https://github.com/Passenger12138/CogVideoX-5B-I2V-v1.5-lora-train/blob/e1d204961b62debcfa7513f0452c3a88d815bea7/finetune/train_cogvideox_image_to_video_lora.py#L1388-L1389 这里图像乘了 vae_scaling_factor
  - https://github.com/THUDM/CogVideo/blob/5ab1e2449ffc8887ffad3ca3b9efd22ad7e356f7/finetune/models/cogvideox_i2v/lora_trainer.py#L143 这里图像乘了 vae_scaling_factor
CogVideoX 1.5 I2V 垂直视频不能推理问题
- 问题3：这种解决方案是否正确？
- 其他相关issue
  - https://github.com/THUDM/CogVideo/issues/194#issuecomment-2485116700 （Nov 19, 2024 有人表达竖屏视频推理支持需求）
  - https://github.com/THUDM/CogVideo/issues/521 （Nov 20, 2024 说支持了但不知具体指的是哪一次提交支持了这个需求）
  - https://github.com/THUDM/CogVideo/issues/486 （Dec 19, 2024 指出CogVideoX1.5-I2V支持垂直视频）
  - https://github.com/THUDM/CogVideo/issues/758 （last week，分辨率是自由的，只要宽度不小于 768，就可以设置为竖屏，但我尝试过三种配置，宽度均不小于768都无法进行推理）
- 表现：不能推理垂直视频（除了 width480 x height720 可以跑通，但是效果并不好），其他比例会出错，例如：
  - --width 768 --height 1360: 不能实现，报错同一内容，RuntimeError: Sizes of tensors must match except in dimension 3. Expected size 85 but got size 48 for tensor number 1 in the list.
  - --width 768 --height 1080: 不能实现，报错同一内容，RuntimeError: Sizes of tensors must match except in dimension 3. Expected size 67 but got size 48 for tensor number 1 in the list.
  - --width 768 --height 960: 不能实现，报错同一内容，RuntimeError: Sizes of tensors must match except in dimension 3. Expected size 60 but got size 48 for tensor number 1 in the list.
- 原因：rope旋转编码嵌入的逻辑假设 sample_width 大于 sample_height ，分别设置为 170 和 96 。
  - 解决方案：需要修改vae模型中的配置，位置CogVideoX1.5-5B-I2V/transformer/config.json，如果需要生成垂直视频，设置
```
{
  ...,
  "sample_height": 170,
  "sample_width": 96,
  ...,
}
```

Apr 13 '25 06:04 JunyaoHu

For the first question, It seems that, for I2V model, the input image condition should not multiply the scale. Therefore, during training, the video latent should multiply the scale, but the image condition shouldn't. I'm not full sure and I will try it.

Apr 24 '25 02:04 OwalnutO

Any news regarding this?

Jul 18 '25 15:07 tomresan

Summary of CogVideoX-5B-I2V-v1.5 inference and fine-tuning about `vae_scaling_factor_image` and vertical video