SVD_Xtend icon indicating copy to clipboard operation
SVD_Xtend copied to clipboard

Questions on text2video?

Open hitsz-zuoqi opened this issue 1 year ago • 5 comments

when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:


    @register_to_config
    def __init__(
        self,
        sample_size: Optional[int] = None,
        in_channels: int = 8,
        out_channels: int = 4,
        down_block_types: Tuple[str] = (

Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:


# Concatenate image_latents over channels dimention
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)

my question is that how to obtain the image_latents if we only use text as a input when training a text2video model? Do you recently have any progress on text2video?

hitsz-zuoqi avatar Jan 17 '24 03:01 hitsz-zuoqi

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

pixeli99 avatar Jan 17 '24 10:01 pixeli99

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

hitsz-zuoqi avatar Jan 19 '24 03:01 hitsz-zuoqi

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

The first two videos look very good, how did u do that?

liiiiiiiiil avatar Jan 19 '24 08:01 liiiiiiiiil

It looks like it's working well, may I ask how many steps this was trained for?

pixeli99 avatar Jan 20 '24 04:01 pixeli99

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

It seems there is a different latent space between the text2video and img2video models. By the way, what model are you finetuning on the Objaverse dataset? It looks work...?

CallMeFrozenBanana avatar Feb 26 '24 10:02 CallMeFrozenBanana