CogVideo Finetune img2video based on T2V model

Thank you very much for your work. We have attempted to finetune the img2video model on our own dataset, but we found that most of the generated scenes tend to be static. Specifically, when we use driving videos, the output is often a video where the ego-vehicle perspective remains stationary.

Did you encounter a similar issue during your finetuning process, or could this be due to the fact that most of the current models are trained on videos with fixed camera view？

Sep 09 '24 05:09 Robertwyq

@Robertwyq May I ask your fine-tuning method for image to video is the same as described in the paper?

Sep 09 '24 06:09 trouble-maker007

Using condition augmentation may be helpful to generate more dynamic videos.

Sep 09 '24 14:09 tengjiayan20

@Robertwyq If you feel like uploading your fine-tuning method in a repo here on GitHub I'm certain you will get tons of good suggestions on how to improve it including how to avoid static behaviour.

Sep 09 '24 14:09 Nojahhh

这里的噪声强度在什么范围会比较合适呢？我用的0.02前半部分会有些动态，后面的偏向于静态

Sep 10 '24 06:09 QingQingS

CogVideoX use the same noise level as stable video diffusion. Dynamics won't be a problem

Sep 10 '24 07:09 yzy-thu

@yzy-thu @tengjiayan20 But I found that after training for a while with the same strategy as SVD, the dynamic changes are similar to SVD, both with relatively small amplitudes of change.Should this be a summary of patterns from the training data?

Sep 10 '24 10:09 trouble-maker007