train_cogvideox_image_to_video_lora bug
Thank you for releasing the I2V training code. I noticed an issue in it. At https://github.com/THUDM/CogVideo/blob/main/finetune/train_cogvideox_image_to_video_lora.py#L1279, the information of the image is completely unused. Should it be changed to noisy_image = torch.randn_like(image) * image_noise_sigma[:, None, None, None, None] + image?
@zRzRzRzRzRzRzR Thank you for your contribution on the training code. Here is a problem I would like to confirm: In diffusers' I2V pipeline, I found that when CFG is enabled, the model drops the text prompt and retains the image latents. However, in the train_cogvideox_image_to_video_lora code, I found that the image latents were only dropped in the data part according to probability, and the text prompt was always retained. Is this setting inconsistent with the pre-training setting of the I2V-5B model?
self-assigned this
i have the same question when i read the code, should we change the image_latents?