VGen
VGen copied to clipboard
I2V architecture
Great work team. I have few questions
- In the diag
it should be VLDM instead of LDM right ?
- In base stage, how does LDM is generating video from input image. Generally LDM uses 2D U-net which are capable of generating images only right ?. Let's say if its an VLDM which uses 3D Unet then input should mulitple frames of noise images right ?
- In refinement stage, For each frame are we applying diffusion and denoise process ? Here also we are using LDM which again uses 2D convolution operations but for temporal coherence we need 3D convolutions right ?
I think I am missing something, can you please help me here. Thanks a lot in advance.
Thank you for your interest in our work.
- Yes, we are using the LDM method for videos.
- In the base stage, we input the image (extracting CLIP features and latent represents separately) and combine it with noise to input into the 3D U-Net to get the output video.
- In practice, we treat the video as a whole during input and use a denoising and diffusion process. For the temporal encoding process, you can refer to the design of our 3D U-Net, Fig.3. Thank you.
Thanks for the reply, 2. Input to 3D Unet is 2D noise or 3D noise ? 3. So input to refinement stage pre-trained LDM is resized output of base stage LDM ? Because any LDM takes noise as input and denoises it to generate video if it takes resized video (non-noise) as input, how it is denoises when there is no noise ?