train_cogvideox_image_to_video_lora bug

Open luyvlei opened this issue 1 year ago • 2 comments

Thank you for releasing the I2V training code. I noticed an issue in it. At https://github.com/THUDM/CogVideo/blob/main/finetune/train_cogvideox_image_to_video_lora.py#L1279, the information of the image is completely unused. Should it be changed to noisy_image = torch.randn_like(image) * image_noise_sigma[:, None, None, None, None] + image?

Oct 12 '24 08:10 luyvlei

@zRzRzRzRzRzRzR Thank you for your contribution on the training code. Here is a problem I would like to confirm: In diffusers' I2V pipeline, I found that when CFG is enabled, the model drops the text prompt and retains the image latents. However, in the train_cogvideox_image_to_video_lora code, I found that the image latents were only dropped in the data part according to probability, and the text prompt was always retained. Is this setting inconsistent with the pre-training setting of the I2V-5B model?

Oct 15 '24 07:10 luyvlei

self-assigned this

i have the same question when i read the code, should we change the image_latents?

Oct 31 '24 08:10 DidiD1