VGen
VGen copied to clipboard
Question of textual inversion training on video
Great work with DreamVideo! While reading your DreamVideo paper, I got some questions about your implementation.
How is textual inversion implemented for video diffusion models? Did you set the number of frames to be 1 and pass it through the video diffusion model? Then did you turn off the temporal module?
Thank you!
Thanks for your interest. Yes, we feed a single image to the video diffusion model (i.e., the video only has 1 frame) for subject learning. For textual inversion, we freeze all model parameters and only train the text embedding.