generative-models icon indicating copy to clipboard operation
generative-models copied to clipboard

How to get the text-to-video model

Open WuTao-CS opened this issue 1 year ago • 11 comments

Exciting work! May I ask where the text-to-video model mentioned and used in the paper can be obtained? I only saw the waitlist to access a new upcoming web. Is there any open source plan?

WuTao-CS avatar Nov 22 '23 03:11 WuTao-CS

mkdir checkpoints cd checkpoints

wget https://huggingface.co/stabilityai/stable-video-diffusion-img2vid/resolve/main/svd.safetensors

crapthings avatar Nov 22 '23 03:11 crapthings

mkdir checkpoints cd checkpoints wget https://huggingface.co/stabilityai/stable-video-diffusion-img2vid/resolve/main/svd.safetensors?download=true

Thank you, But this is the image to video model, I'm asking the text to video model.

WuTao-CS avatar Nov 22 '23 04:11 WuTao-CS

text to video isnt out yet

Fearblade66 avatar Nov 22 '23 05:11 Fearblade66

mkdir checkpoints cd checkpoints wget huggingface.co/stabilityai/stable-video-diffusion-img2vid/resolve/main/svd.safetensors?download=true

Thank you, But this is the image to video model, I'm asking the text to video model.

i think its easy to combie diffusers and image to video to do this

crapthings avatar Nov 22 '23 08:11 crapthings

Yes I'm I think it's easy to create such pipeline:

  1. generate image using good finetuned sd 1.5
  2. use this as reference image for the image to video model

maybe that's how they do it in the demo video.

CyberTimon avatar Nov 22 '23 13:11 CyberTimon

So has anyone managed to run it? Even image to video?

vicitooo avatar Nov 22 '23 17:11 vicitooo

The paper they released doesn't indicate that there will be a text-to-video model. It seems the intention is to combine image-to-video models with traditional text-to-image models to generate the initial frame.

From the paper:

Finally, many recent works tackle the task of image-to-video synthesis, where the start frame is already given and the model has to generate the consecutive frames [30, 93, 108]. Importantly, as shown in our work (see Figure 1) when combined with off-the-shelf text-to-image models, image-to-video models can be used to obtain a full text-(to-image)-to-video pipeline.

dgparker avatar Nov 25 '23 04:11 dgparker

So has anyone managed to run it? Even image to video?

Yup... it works. After you install the package and prepare the env following the instructions, You need to download the model as mentioned by @crapthings :

mkdir checkpoints
cd checkpoints
wget https://huggingface.co/stabilityai/stable-video-diffusion-img2vid/resolve/main/svd.safetensors

Then run the streamlit, change <your_port> to whatever port you want streamlit run scripts/demo/sampling.py --server.port <your_port>

if you are running this on a remote machine, make sure to
tunnel

Then navigate your browser to: localhost:<your_port>/

Example: run: streamlit run scripts/demo/sampling.py --server.port 8888 Navigate to: localhost:8888/

gutzcha avatar Nov 29 '23 08:11 gutzcha

@CyberTimon but how would you control what happens in the video ?

mayank64ce avatar Mar 05 '24 19:03 mayank64ce

Hey @mayank64ce, I'm sorry but I can't tell you this. I'm not that experienced with stable video etc..

CyberTimon avatar Mar 05 '24 20:03 CyberTimon

The technical paper on my side is quite misleading regarding the text-to-video part. By default, we assume the codes are aligned with what is claimed but unfortunately, it's currently not the case.

Mercurise avatar Mar 19 '24 11:03 Mercurise