Text-To-Video-Finetuning icon indicating copy to clipboard operation
Text-To-Video-Finetuning copied to clipboard

Finetune ModelScope's Text To Video model using Diffusers 🧨

Results 28 Text-To-Video-Finetuning issues
Sort by recently updated
recently updated
newest added
trafficstars

[ return rearrange(item / (127.5 - 1.0), 'f h w c -> f c h w')](https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/83e11c702b2fb30248e488bc0a11680cfaa56558/utils/dataset.py#L41C20-L41C74) Change to ```return rearrange(item / 127.5 - 1.0, 'f h w c -> f...

- [x] Allow for multiple cached latents. - [x] Update param to allow shuffling. - [x] Automatically cast to float32. Uses more memory, but encourages better stability. - [x] Allow...

``` Some weights of the model checkpoint were not used when initializing UNet3DConditionModel: This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another...

Hello sir after training the model then how to test my model giving text as input please help me in this issue

More of a question really, but do you know why the num_attention_heads and attention_head_dim are opposite when initialising Transformer2D blocks? https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/79e13d17167f66f424a8acad88e83fc76d6d210d/models/unet_3d_blocks.py#L286C17-L286C35 It is opposite in unit_2d_blocks.py https://github.com/huggingface/diffusers/blob/5439e917cacc885c0ac39dda1b8af12258e6e16d/src/diffusers/models/unet_2d_blocks.py#L872

I want to train my own video model, please give me some help How long should I cut each video into? How many frames per video? How many videos are...