VGen I2V architecture

Great work team. I have few questions

In the diag it should be VLDM instead of LDM right ?
In base stage, how does LDM is generating video from input image. Generally LDM uses 2D U-net which are capable of generating images only right ?. Let's say if its an VLDM which uses 3D Unet then input should mulitple frames of noise images right ?
In refinement stage, For each frame are we applying diffusion and denoise process ? Here also we are using LDM which again uses 2D convolution operations but for temporal coherence we need 3D convolutions right ?

I think I am missing something, can you please help me here. Thanks a lot in advance.

Jan 06 '24 19:01 Sanath91009

Thank you for your interest in our work.

Yes, we are using the LDM method for videos.
In the base stage, we input the image (extracting CLIP features and latent represents separately) and combine it with noise to input into the 3D U-Net to get the output video.
In practice, we treat the video as a whole during input and use a denoising and diffusion process. For the temporal encoding process, you can refer to the design of our 3D U-Net, Fig.3. Thank you.

Jan 08 '24 02:01 Steven-SWZhang

Thanks for the reply, 2. Input to 3D Unet is 2D noise or 3D noise ? 3. So input to refinement stage pre-trained LDM is resized output of base stage LDM ? Because any LDM takes noise as input and denoises it to generate video if it takes resized video (non-noise) as input, how it is denoises when there is no noise ?

Jan 08 '24 10:01 Sanath91009