CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

request for inference using multiple GPUs

Open GeeveGeorge opened this issue 1 year ago • 7 comments

Feature request / 功能建议

currently the inference scripts are focussed on running on single GPU, it would be great if we could leverage multiple GPUs for inference, kindly enable this and update the scripts accordingly.

Motivation / 动机

multi gpu clusters can speed up inference times

Your contribution / 您的贡献

whats the best way to do this?

GeeveGeorge avatar Sep 21 '24 14:09 GeeveGeorge

Hi @GeeveGeorge , How much time it is taking to generate 1 video ?

Neethan54 avatar Sep 21 '24 16:09 Neethan54

Came to ask about the same thing, I have 56GB of vRAM but it's OOMing after filling just the first 24GB card.

In the readme it states to disable 'enable_sequential_cpu_offload' but I can't see where to do that for the gradio demo as used on huggingface, it looks to only exist hard coded in the cli python.

sammcj avatar Sep 22 '24 00:09 sammcj

As long as this line of code is not present, it is disabled by default; in gradio, it is disabled, but this will significantly increase video memory, exceeding 24G

zRzRzRzRzRzRzR avatar Sep 22 '24 06:09 zRzRzRzRzRzRzR

Feature

Do you want to split a single model across different GPUs? In our cli_demo, there is an explanation about device map

zRzRzRzRzRzRzR avatar Sep 22 '24 06:09 zRzRzRzRzRzRzR

As a feature it would be great if you could tick a box in the gradio UI that allowed it to use all available nvidia GPUs so you can use more than the vRAM available on the first card.

sammcj avatar Sep 22 '24 22:09 sammcj

I've two GPU with 24G HBM each.

at the begining, code modification as below

inference/cli_demo.py


     elif generate_type == "t2v":
-        pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
+        pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype,device_map="balanced")

 
-    # pipe.to("cuda")
-    pipe.enable_sequential_cpu_offload()
+    pipe.to("cuda")
+    #pipe.enable_sequential_cpu_offload()
 

But still failed as below error (GPU hadn't reach the HBM limit yet)

Loading pipeline components...: 100%|████████| 5/5 [03:31<00:00, 42.37s/it]
Traceback (most recent call last):
  File "/home/jovyan/CogVideo/inference/cli_demo.py", line 177, in <module>
    generate_video(
  File "/home/jovyan/CogVideo/inference/cli_demo.py", line 99, in generate_video
    pipe.to("cuda")
  File "/opt/conda/lib/python3.11/site-packages/diffusers/pipelines/pipeline_utils.py", line 396, in to
    raise ValueError(
ValueError: It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading.

I removed all pipe.* code as below, then it works finally

+    #pipe.to("cuda")
+    #pipe.enable_sequential_cpu_offload()

panpan0000 avatar Sep 25 '24 03:09 panpan0000

Yes, if you are running on multi-GPU, as mentioned in our readme, you must delete

enable sequential CPU offload

zRzRzRzRzRzRzR avatar Sep 25 '24 03:09 zRzRzRzRzRzRzR