SVD_Xtend
SVD_Xtend copied to clipboard
Questions on text2video?
when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:
@register_to_config
def __init__(
self,
sample_size: Optional[int] = None,
in_channels: int = 8,
out_channels: int = 4,
down_block_types: Tuple[str] = (
Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:
# Concatenate image_latents over channels dimention
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)
my question is that how to obtain the image_latents if we only use text as a input when training a text2video model? Do you recently have any progress on text2video?
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...).
If anyone has any suggestions, feel free to share them here, and I will give them a try.
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the
conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.
Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like:
Prompt: "a desk"
Prompt: "a sofa"
When the training beginning, the sampling results are:
Prompt: "a desk"
Prompt: "a sofa"
From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in
of unet to 4 channels.
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the
conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk"
![]()
![]()
When the training beginning, the sampling results are: Prompt: "a desk"
![]()
Prompt: "a sofa"
![]()
![]()
From the training performance, I think it is nearly equal to train from scratch for my task if changing the
conv in
of unet to 4 channels.
The first two videos look very good, how did u do that?
It looks like it's working well, may I ask how many steps this was trained for?
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the
conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk"
![]()
![]()
When the training beginning, the sampling results are: Prompt: "a desk"
![]()
Prompt: "a sofa"
![]()
![]()
From the training performance, I think it is nearly equal to train from scratch for my task if changing the
conv in
of unet to 4 channels.
It seems there is a different latent space between the text2video and img2video models. By the way, what model are you finetuning on the Objaverse dataset? It looks work...?