Question about the motion adapter in DreamVideo
Hi! I find that each time one frame of the guided video is selected to train the motion adapter. But since selecting only one image will break the coherence of a video, I wonder how the motion adapter can capture the temporal motion pattern? Thanks!
Hi, thanks for your interest. We train the motion adapter using all frames of input videos. Meanwhile, we select a random frame serving as the appearance guidance.
Thanks for your response! I also have another few questions: (1) how long does it take for each stage in DreamVideo? I have tried in my own server and found that it takes about 2 hours for just the 1st stage in subject learning. Is it normal? I use 4 V100 PCIE GPUs; (2) Could you provide the link of open_clip_pytorch_model.bin of FrozenOpenCLIPCustomEmbedder?
Thanks for your response! I also have another few questions: (1) how long does it take for each stage in DreamVideo? I have tried in my own server and found that it takes about 2 hours for just the 1st stage in subject learning. Is it normal? I use 4 V100 PCIE GPUs; (2) Could you provide the link of open_clip_pytorch_model.bin of FrozenOpenCLIPCustomEmbedder?
Hi. (1) We use one A100 80G GPU. It takes about 50 min for step 1 in subject learning and 10~15 min for step 2. I think your situation is normal due to device differences. By the way, you can reduce the number of training iterations to balance the performance and time costs. (2) The 'open_clip_pytorch_model.bin' used in DreamVideo is the same as the other models (I2VGen-XL, HiGen, TF-T2V, etc.) in this repository. You can download the ckpt from this link: https://modelscope.cn/api/v1/models/iic/tf-t2v/repo?Revision=master&FilePath=open_clip_pytorch_model.bin.
Thank you very much! By the way, how long does it take to evaluate on all datasets mentioned in the paper of DreamVideo? Could you provide the evaluation code?