CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

Slow loading of model and very delay in image to video

Open Neethan54 opened this issue 1 year ago • 22 comments

Hii,

I am facing issue with delay in model loading and also the time taken to generate the video from Image. Currently it is taking 8minutes for 8 seconds video, I have 48GB VRAM , but still it is very slow.

Please let me know , if there is any way to solve this.

This is the code im using .

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

from diffusers import (
    CogVideoXPipeline,
    CogVideoXDPMScheduler,
    CogVideoXVideoToVideoPipeline,
    CogVideoXImageToVideoPipeline,
    CogVideoXTransformer3DModel,
)
print('loading I2V model...')
pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    transformer=CogVideoXTransformer3DModel.from_pretrained(
        "THUDM/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
    ),
    torch_dtype=torch.bfloat16
).to("cuda")

import random
seed = random.randint(0, 2**8 - 1)
print('loading image..')
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
    )
prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."

negative_prompt ="The video is not of a high quality, it has a low resolution. Strange motion trajectory. Flickering, Blurriness, Face restore.Deformation, anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured "
video_pt = pipe_image(
            image=image,
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=50,
            num_videos_per_prompt=1,
            use_dynamic_cfg=True,
            output_type="pt",
            guidance_scale=7.0,
            num_frames=49,
            generator=torch.Generator(device="cuda").manual_seed(seed),
        ).frames

batch_video_frames = []
batch_size = video_pt.shape[0]
from diffusers.image_processor import VaeImageProcessor
for batch_idx in range(batch_size):
    pt_image = video_pt[batch_idx]
    pt_image = torch.stack([pt_image[i] for i in range(pt_image.shape[0])])

    image_np = VaeImageProcessor.pt_to_numpy(pt_image)
    image_pil = VaeImageProcessor.numpy_to_pil(image_np)
    batch_video_frames.append(image_pil)
export_to_video(batch_video_frames[0], "videos/output.mp4", fps=8)

Thanks in Advance

Neethan54 avatar Sep 20 '24 05:09 Neethan54

What GPU are you using, it shouldn't be this slow. Also, the video should be 6 seconds long, can you calculate how long the average step took?

zRzRzRzRzRzRzR avatar Sep 20 '24 07:09 zRzRzRzRzRzRzR

the GPU details are like below ![image](https://github.com/user-attachments/assets/1a92da51-ebdd-42c6-90e8-2d42413ae2d6

Neethan54 avatar Sep 20 '24 07:09 Neethan54

yes the video duration is 6 seconds long

Neethan54 avatar Sep 20 '24 07:09 Neethan54

This speed is clearly incorrect, however, for equipment like yours, I suggest operating according to this plan image This will significantly increase the speed

zRzRzRzRzRzRzR avatar Sep 21 '24 05:09 zRzRzRzRzRzRzR

Hi @zRzRzRzRzRzRzR ,

I tried your suggestion, But now it is taking 14 minutes for 6 second Video, Below is the code im using

pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    transformer=CogVideoXTransformer3DModel.from_pretrained(
        "THUDM/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
    ),
    torch_dtype=torch.bfloat16
) 

pipe_image.enable_sequential_cpu_offload()


seed = random.randint(0, 2**8 - 1)
prompt='A worker talking to his supervisor in an construction site. High quality, masterpiece, best quality, highres, ultra-detailed, fantastic.'
img_path='images/image_3.png'
from PIL import Image
pil_image = Image.open(img_path).resize(size=(720, 480))
image = load_image(img_path)
negative_prompt ="The video is not of a high quality, it has a low resolution. Strange motion trajectory. Flickering, Blurriness, Face restore.Deformation, anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured "
video_pt = pipe_image(
            image=image,
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=50,
            num_videos_per_prompt=1,
            use_dynamic_cfg=True,
            output_type="pt",
            guidance_scale=7.0,
            num_frames=49,
            generator=torch.Generator(device="cuda").manual_seed(seed),
        ).frames

Please let me know, if im doing Wrong.

Neethan54 avatar Sep 21 '24 05:09 Neethan54

This code is correct, I did not see any errors

video_pt = pipe_image(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50,
num_videos_per_prompt=1,
use_dynamic_cfg=True,
output_type="pt",
guidance scale 7.0
number of frames 49
generator=torch.Generator(device="cuda").manual_seed(seed),
).frames[0]

In this task, did it take 14 minutes? Our speed test only measures this step

zRzRzRzRzRzRzR avatar Sep 22 '24 06:09 zRzRzRzRzRzRzR

This is clearly not the level of the A6000, even the T4 is faster than this

zRzRzRzRzRzRzR avatar Sep 22 '24 06:09 zRzRzRzRzRzRzR

yes surprisingly, It is taking 14 minutes .

Neethan54 avatar Sep 22 '24 10:09 Neethan54

Hi @zRzRzRzRzRzRzR

How much time it is taking for you to generate 6 second video?

Neethan54 avatar Sep 23 '24 17:09 Neethan54

I use A100 for 180 seconds with the 5B model

zRzRzRzRzRzRzR avatar Sep 24 '24 07:09 zRzRzRzRzRzRzR

can you please share the code , i want to check in A6000

Neethan54 avatar Sep 24 '24 07:09 Neethan54

i used 3090 on defulat cli_demo it is taking 12 minutes for 6 second Video image used very few VRAM,Is this the correct speed? @zRzRzRzRzRzRzR

Shiroha-Key avatar Sep 24 '24 09:09 Shiroha-Key

Same for me. On I2V it takes about 10 minutes on an RTX 4090. Only about 3GB of VRAM is used. I added the following code

pipe_image.enable_sequential_cpu_offload()
pipe_image.vae.enable_tiling()

It will take time, but since there is plenty of VRAM available, it seems that performance can be further improved by increasing the resolution and length. Please continue with the development. Also, would it be difficult to generate a video during inference?

If it takes a long time to generate the video, it will be a problem if you cannot predict the result until the video is completed. It would be good if you could see the intermediate results, even if it is at a low resolution and low frame rate.

Enchante503 avatar Sep 24 '24 10:09 Enchante503

For 4090, you can completely remove

pipe_image.enable_sequential_cpu_offload()

and just move pipe.to("cuda"), should work Currently, there is indeed no way to visualize the intermediate results

zRzRzRzRzRzRzR avatar Sep 24 '24 12:09 zRzRzRzRzRzRzR

@zRzRzRzRzRzRzR

Im using the below torch with cuda version, is this correct?

CUDA 12.1

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

Neethan54 avatar Sep 24 '24 16:09 Neethan54

This should be fine, as the 2.4.0 version of PyTorch can also be compiled with CUDA 12.1.

zRzRzRzRzRzRzR avatar Sep 25 '24 03:09 zRzRzRzRzRzRzR

@zRzRzRzRzRzRzR
Can you please share the code which you are running in A100

Neethan54 avatar Sep 25 '24 04:09 Neethan54

https://github.com/THUDM/CogVideo/blob/main/inference/cli_demo.py follow this and remove the pipe_image.enable_sequential_cpu_offload() and use pipe.to("cuda")

zRzRzRzRzRzRzR avatar Sep 25 '24 08:09 zRzRzRzRzRzRzR

@zRzRzRzRzRzRzR I am using the above code and as you can see it is taking 8-9 minutes for 6 seconds.

image

Neethan54 avatar Sep 25 '24 08:09 Neethan54

hello!any progress here?same problem

lingyu123-su avatar Oct 04 '24 09:10 lingyu123-su

I think the main reason is that, you should add pipe = pipe.cuda() when copying the code from colab.

haochengxi avatar Oct 08 '24 06:10 haochengxi

Hi @xijiu9 ,

Check this code, https://github.com/THUDM/CogVideo/issues/316#issue-2537904293.

I have added .cuda(), still it was taking so much time in windows OS.

Neethan54 avatar Oct 09 '24 03:10 Neethan54

This speed is clearly incorrect, however, for equipment like yours, I suggest operating according to this plan image This will significantly increase the speed

isnt enable_sequential_cpu_offload meant to save memory? how does this increase the speed?

danielajisafe avatar Dec 09 '24 07:12 danielajisafe