request for inference using multiple GPUs
Feature request / 功能建议
currently the inference scripts are focussed on running on single GPU, it would be great if we could leverage multiple GPUs for inference, kindly enable this and update the scripts accordingly.
Motivation / 动机
multi gpu clusters can speed up inference times
Your contribution / 您的贡献
whats the best way to do this?
Hi @GeeveGeorge , How much time it is taking to generate 1 video ?
Came to ask about the same thing, I have 56GB of vRAM but it's OOMing after filling just the first 24GB card.
In the readme it states to disable 'enable_sequential_cpu_offload' but I can't see where to do that for the gradio demo as used on huggingface, it looks to only exist hard coded in the cli python.
As long as this line of code is not present, it is disabled by default; in gradio, it is disabled, but this will significantly increase video memory, exceeding 24G
Feature
Do you want to split a single model across different GPUs? In our cli_demo, there is an explanation about device map
As a feature it would be great if you could tick a box in the gradio UI that allowed it to use all available nvidia GPUs so you can use more than the vRAM available on the first card.
I've two GPU with 24G HBM each.
at the begining, code modification as below
inference/cli_demo.py
elif generate_type == "t2v":
- pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
+ pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype,device_map="balanced")
- # pipe.to("cuda")
- pipe.enable_sequential_cpu_offload()
+ pipe.to("cuda")
+ #pipe.enable_sequential_cpu_offload()
But still failed as below error (GPU hadn't reach the HBM limit yet)
Loading pipeline components...: 100%|████████| 5/5 [03:31<00:00, 42.37s/it]
Traceback (most recent call last):
File "/home/jovyan/CogVideo/inference/cli_demo.py", line 177, in <module>
generate_video(
File "/home/jovyan/CogVideo/inference/cli_demo.py", line 99, in generate_video
pipe.to("cuda")
File "/opt/conda/lib/python3.11/site-packages/diffusers/pipelines/pipeline_utils.py", line 396, in to
raise ValueError(
ValueError: It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading.
I removed all pipe.* code as below, then it works finally
+ #pipe.to("cuda")
+ #pipe.enable_sequential_cpu_offload()
Yes, if you are running on multi-GPU, as mentioned in our readme, you must delete
enable sequential CPU offload