Text-To-Video-Finetuning similar implementation to Nivida VideoLDM?

Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model? https://research.nvidia.com/labs/toronto-ai/VideoLDM/

i saw this unofficial implementation, but not sure where it goes? https://github.com/srpkdyy/VideoLDM

is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?

Jun 25 '23 15:06 Maki9009

Hey @Maki9009 .

Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model?

I don't have it implemented in this repository yet, but you should be able to fine tune any current SD model on video. While that paper does do a bit more, the concepts are the same (add temporal attention and convolution layers after each pre-trained spatial layer).

i saw this unofficial implementation, but not sure where it goes?

That implementation is not complete.

is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?

You should be able to merge two models trained on video data, but if you're talking about training the pre-trained layers trained on images, you still may have to fine tune them to pick up the temporal information.

Jun 25 '23 18:06 ExponentialML

Hi @ExponentialML Thanks for responding

So just to get it correct its possible to finetune any Dreambooth SD model to make into a txt2vid model, is ur implementation not ready? or could I attempt to do it right now?

Im just wondering what the process/guide would be to do that. or are you still working on it currently?

Also on you're last point, i wouldnt be able to merge an img model to video model. i would need to first finetune it with for the temporal layers than i can merge?

Jun 25 '23 19:06 Maki9009

No problem. Yes that's correct. The UNet3DConditionModel takes a UNet2DConditionModel, which this implementation uses.

You would have to fine tune the temporal layers from scratch, which may take time (this is why people start from modelscope's, it starts as a great base).

You may be able to replace the spatial layers with another SD model and keep modelscope's temporal layers, but again I haven't tested it / implemented it as of yet.

On your last question, it's a bit tricky. The attempt I would do is merge the image model layers, then fine tune on arbitrary video data so the temporal layers could pick up the newly added data.

Jun 25 '23 19:06 ExponentialML

yeah, thats what I was looking into, replacing the spatial layers with another SD model and keeping temporal layers of model scope or zero scope, in a sense that would make it faster right, rather than finetuning a new model.

on average, how long does it take to finetune the modelscope model with lets say images 20-30 images.. something similar to how Dreambooth works? or would that not be possible at all or pointless? let's say i want to implement my cat into the txt2vid model / modelscope, similar to nvidia's example

Jun 25 '23 19:06 Maki9009

on average, how long does it take to finetune the modelscope model with lets say images 20-30 images.. something similar to how Dreambooth works? or would that not be possible at all or pointless? let's say i want to implement my cat into the txt2vid model / modelscope, similar to nvidia's example

It depends on how you're training the Dreambooth. If it's just the spatial layers, the same amount of time as other Dreambooth methods. If doing a full fine tune, it's dependent on how many frames you're training (acts similar to a large batch size).

Jul 02 '23 18:07 ExponentialML

Text-To-Video-Finetuning Text-To-Video-Finetuning copied to clipboard

similar implementation to Nivida VideoLDM?

Text-To-Video-Finetuning
Text-To-Video-Finetuning copied to clipboard