CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

Great work! when are you planning to release image-to-video models?

Open gxd1994 opened this issue 1 year ago • 11 comments
trafficstars

gxd1994 avatar Aug 07 '24 08:08 gxd1994

Thank you for your support. However, this might take some time as we currently do not have any related plans in the near future. Thank you for your understanding.

zRzRzRzRzRzRzR avatar Aug 07 '24 09:08 zRzRzRzRzRzRzR

Hi @zRzRzRzRzRzRzR

But as you mentioned in your paper, you already have an image-to-video version of CogVideoX

图片

StarCycle avatar Aug 07 '24 13:08 StarCycle

Hi @zRzRzRzRzRzRzR

But as you mentioned in your paper, you already have an image-to-video version of CogVideoX

图片

Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.

tengjiayan20 avatar Aug 07 '24 14:08 tengjiayan20

Hi @zRzRzRzRzRzRzR But as you mentioned in your paper, you already have an image-to-video version of CogVideoX 图片

Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.

Would absolutely appreciate the release of the img+text to video :)

matbeedotcom avatar Aug 07 '24 23:08 matbeedotcom

Hi @tengjiayan20,

Thank you for the response!

Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?

Do you apply a fixed noise level on the image condition in the diffusion process?

Sorry but I really need a image2video model in my application.

Best wishes, StarCycle

StarCycle avatar Aug 08 '24 01:08 StarCycle

Hi @tengjiayan20,

Thank you for the response!

Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?

Do you apply a fixed noise level on the image condition in the diffusion process?

Sorry but I really need a image2video model in my application.

Best wishes, StarCycle

  1. I think it is ok. After all, many image-to-video works have verified that webvid dataset can satisfy this task. The key is that they don't have a better base text-to-video model.
  2. Augmentation in training is beneficial

tengjiayan20 avatar Aug 08 '24 03:08 tengjiayan20

@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?

eugenelyj avatar Aug 08 '24 08:08 eugenelyj

@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?

Usually the strength of augmentation is random and dynamic. Since the augmentation is added to the image condition, and the image condition is constant during sampling and does not change with the timestep, I think the strength of augmentation does not need to change with the timestep. It is just to enhance robustness to fill the gap between the conditions during training and inference. But of course, you can try it, maybe it would be better in practice.

tengjiayan20 avatar Aug 08 '24 08:08 tengjiayan20

Do you do conditioning similarly in i2v models as compared to t2v models? For example, do you concatenate the image embeddings (instead of text embeddings) with the video tokens as conditioning? Or instead, do you replace the first frame of the latent model with the image?

jinhuaca avatar Aug 08 '24 17:08 jinhuaca

@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.

eugenelyj avatar Aug 17 '24 07:08 eugenelyj

@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.

SVD use a conv to upscale the doubled channels...

wangqiang9 avatar Sep 02 '24 11:09 wangqiang9

@zRzRzRzRzRzRzR Dear author, can you share the implementation of add_noise_to_first_frame, in particular the detailed parameters (distribution) of the added noise.

eugenelyj avatar Sep 17 '24 14:09 eugenelyj

yes,check our sat code now, is in pr

zRzRzRzRzRzRzR avatar Sep 18 '24 07:09 zRzRzRzRzRzRzR

And I2V model will opensource in next 24 hours, this issue close

zRzRzRzRzRzRzR avatar Sep 18 '24 07:09 zRzRzRzRzRzRzR