Text-To-Video-Finetuning
Text-To-Video-Finetuning copied to clipboard
similar implementation to Nivida VideoLDM?
Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model? https://research.nvidia.com/labs/toronto-ai/VideoLDM/
i saw this unofficial implementation, but not sure where it goes? https://github.com/srpkdyy/VideoLDM
is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?
Hey @Maki9009 .
Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model?
I don't have it implemented in this repository yet, but you should be able to fine tune any current SD model on video. While that paper does do a bit more, the concepts are the same (add temporal attention and convolution layers after each pre-trained spatial layer).
i saw this unofficial implementation, but not sure where it goes?
That implementation is not complete.
is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?
You should be able to merge two models trained on video data, but if you're talking about training the pre-trained layers trained on images, you still may have to fine tune them to pick up the temporal information.
Hi @ExponentialML Thanks for responding
So just to get it correct its possible to finetune any Dreambooth SD model to make into a txt2vid model, is ur implementation not ready? or could I attempt to do it right now?
Im just wondering what the process/guide would be to do that. or are you still working on it currently?
Also on you're last point, i wouldnt be able to merge an img model to video model. i would need to first finetune it with for the temporal layers than i can merge?
No problem. Yes that's correct. The UNet3DConditionModel takes a UNet2DConditionModel, which this implementation uses.
You would have to fine tune the temporal layers from scratch, which may take time (this is why people start from modelscope's, it starts as a great base).
You may be able to replace the spatial layers with another SD model and keep modelscope's temporal layers, but again I haven't tested it / implemented it as of yet.
On your last question, it's a bit tricky. The attempt I would do is merge the image model layers, then fine tune on arbitrary video data so the temporal layers could pick up the newly added data.
yeah, thats what I was looking into, replacing the spatial layers with another SD model and keeping temporal layers of model scope or zero scope, in a sense that would make it faster right, rather than finetuning a new model.
on average, how long does it take to finetune the modelscope model with lets say images 20-30 images.. something similar to how Dreambooth works? or would that not be possible at all or pointless? let's say i want to implement my cat into the txt2vid model / modelscope, similar to nvidia's example
on average, how long does it take to finetune the modelscope model with lets say images 20-30 images.. something similar to how Dreambooth works? or would that not be possible at all or pointless? let's say i want to implement my cat into the txt2vid model / modelscope, similar to nvidia's example
It depends on how you're training the Dreambooth. If it's just the spatial layers, the same amount of time as other Dreambooth methods. If doing a full fine tune, it's dependent on how many frames you're training (acts similar to a large batch size).