Question of textual inversion training on video

Open tykim0507 opened this issue 1 year ago • 1 comments

Great work with DreamVideo! While reading your DreamVideo paper, I got some questions about your implementation.

How is textual inversion implemented for video diffusion models? Did you set the number of frames to be 1 and pass it through the video diffusion model? Then did you turn off the temporal module?

Thank you!

Mar 29 '24 05:03 tykim0507

Thanks for your interest. Yes, we feed a single image to the video diffusion model (i.e., the video only has 1 frame) for subject learning. For textual inversion, we freeze all model parameters and only train the text embedding.

Mar 31 '24 14:03 weilllllls