CUDA OOM and possible solution -- diffusers cli_demo.py with Nvidia 3090 24GB
System Info / 系統信息
Thanks very much for releasing this great work!
In case this helps anyone else:
The diffusers cli_demo.py raised the CUDA OOM error below on an RTX 3090 with 24GB of VRAM using this command:
python cli_demo.py --prompt "A fish swimming underwater through a colorful coral reef. Sun is shining brightly through the water. It is a beautiful scene suitable for use in an eye-catching television advertisement." --model_path THUDM/CogVideoX-2b --num_inference_steps 50
... but works with barely enough free VRAM with this small adjustment -- set the env var PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True prior to running:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python cli_demo.py --prompt "A fish swimming underwater through a colorful coral reef. Sun is shining brightly through the water. It is a beautiful scene suitable for use in an eye-catching television advertisement." --model_path THUDM/CogVideoX-2b --num_inference_steps 50
CUDA OOM error:
return torch._C._nn.pad(input, pad, mode, value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.66 GiB. GPU 0 has a total capacity of 23.48 GiB of which 1.53 GiB is free. Including non-PyTorch memory, this process has 21.81 GiB memory in use. Of the allocated memory 18.92 GiB is allocated by PyTorch, and 2.58 GiB is reserved by PyTorch but unallocated.
nvidia-smi before running -- ony 135MiB VRAM used:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:09:00.0 Off | N/A |
| 0% 32C P8 17W / 350W | 135MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
nvidia-smi while running with PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True -- 23GB used but no CUDA OOM.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:09:00.0 Off | N/A |
| 0% 40C P2 157W / 350W | 23335MiB / 24576MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
This was using a fresh conda environment with dependencies installed by pip from requirements.txt (side note: please include imageio in requirements.txt and also open the version range for opencv-python to >=4.10 as 4.10 was yanked from upstream)
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
Details above.
Expected behavior / 期待表现
Details above.
The generated video is great, by the way -- keep up the good work!
https://github.com/user-attachments/assets/047d3340-5e01-4120-a247-e76668a1703d
So, have you solve your OOM problem? When we test the demo, it needs 23.9GB in fact. It may be a bit extreme, causing occasional oom.
Yes, setting this environment variable before running solved it for me:
PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True
and I see it using a little over 20GB with that setting with nvidia-smi:
20423MiB / 24576MiB
I am getting OOM using 4 32GB GPUs. Using device_map="balanced" seems to split across 3 of the cards, before throwing the OOM error.
did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()
did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()
this OOM's
as does this
Both attempts attempt to allocate the ~36GB This also happens when swapping to sat/inference.sh with the sample .txt
PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True Try this. And what is your nvidia driver and GPU? V100 I guess. It should work( Although we only test in 3090 and A100)
PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True Try this. And what is your nvidia driver and GPU? V100 I guess. It should work( Although we only test in 3090 and A100)
Also does it with that set for both multi and single attempts GPUs are AMD Instinct MI100s, ROCm 6.0 I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(
you can use torch. 2.2 2.3 2.4 both not work right?
you can use torch. 2.2 2.3 2.4 both not work right?
Torch 2.2.2, 2.3.1, and 2.4.0 all fail with the same attempted memory usage of ~36GB
Just an FYI: If you install accelerate from the branch in following PR, the Diffusers demo runs in ~18 GB. Context: https://github.com/huggingface/accelerate/pull/2994#issuecomment-2276734638
Code
import gc
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return f"{(bytes / 1024 / 1024 / 1024):.3f}"
flush()
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")
export_to_video(video, "output.mp4", fps=8)
The PR will be merged into accelerate main hopefully soon. If you cannot or do not want to, for some reason, use accelerate from the dev branch, you could do the following:
Code without accelerate dev branch requirement
import gc
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return f"{(bytes / 1024 / 1024 / 1024):.3f}"
flush()
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16).to("cuda")
latents = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50, output_type="latent", return_dict=False)[0]
pipe.transformer.to("cpu")
pipe.text_encoder.to("cpu")
torch.cuda.synchronize()
torch.cuda.empty_cache()
with torch.no_grad():
video = pipe.decode_latents(latents, num_seconds=6)
video = pipe.video_processor.postprocess_video(video=video, output_type="pil")
torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")
export_to_video(video, "output.mp4", fps=8)
Additionally, you can play around with the device_map parameter if you have multiple GPUs, quantize text encoder or full transformer. Denoising only requires about 12-14 GB memory (if using cpu offloading) but it's the VAE that takes the most amount of memory (1 GB model + 17 GB decoding). We are working on figuring out tiled-decoding but nothing promising yet. I would imagine that CogVideoX would be runnable on a free-tier T4 or lower if someone can figure out tiling so if anyone's got ideas, feel free to PR at Diffusers
Just an FYI: If you install accelerate from the branch in following PR, the Diffusers demo runs in ~18 GB. Context: huggingface/accelerate#2994 (comment) Code
The PR will be merged into accelerate main hopefully soon. If you cannot or do not want to, for some reason, use accelerate from the dev branch, you could do the following: Code without accelerate dev branch requirement
Additionally, you can play around with the
device_mapparameter if you have multiple GPUs, quantize text encoder or full transformer. Denoising only requires about 12-14 GB memory (if using cpu offloading) but it's the VAE that takes the most amount of memory (1 GB model + 17 GB decoding). We are working on figuring out tiled-decoding but nothing promising yet. I would imagine that CogVideoX would be runnable on a free-tier T4 or lower if someone can figure out tiling so if anyone's got ideas, feel free to PR at Diffusers
- Running your provided non-dev-branch code and only swapping out the model for the HF
THUDM/CogVideoX-2bstill OOMs with the same numbers. - Adding
PYTORCH_NO_MEMORY_CACHING=1andpipe.enable_model_cpu_offload()also OOMs - Installing from source after cloning
huggingface/accelerateand checking outtest-clear-memory-cpu-offload, runningpip install .and executing, also OOMs - Doing all of the above combined but adding
device_map="balanced"also OOMs
I am perfectly able to run the 2nd example above on an A4500 (20GB) with a fresh Pytorch 2.3 install and diffusers:main. We'll have to try and debug what's going wrong in your setup.
You mention:
I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(
Can you paste the error stack trace here? Would like to know at what point it's failing. If it's failing somewhere in attention, it's probably because you're unable to use FA2 - which is necessary to be able to run with low memory. Can you try setting up pytroch so that it allows you to run FA2?
I am perfectly able to run the 2nd example above on an A4500 (20GB) with a fresh Pytorch 2.3 install and diffusers:main. We'll have to try and debug what's going wrong in your setup.
You mention:
I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(
Can you paste the error stack trace here? Would like to know at what point it's failing. If it's failing somewhere in attention, it's probably because you're unable to use FA2 - which is necessary to be able to run with low memory. Can you try setting up pytroch so that it allows you to run FA2?
Installed flash_attn 2.0.4 built from source, not sure how to force torch to let me use it
Output
/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/cuda/memory.py:343: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
The config attributes {'mid_block_add_attention': True} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading pipeline components...: 0%| | 0/5 [00:00<?, ?it/s]The config attributes {'mid_block_add_attention': True} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading pipeline components...: 20%|██ | 1/5 [00:00<00:03, 1.03it/s]
Loading pipeline components...: 60%|██████ | 3/5 [00:01<00:00, 3.24it/s]
Loading pipeline components...: 80%|████████ | 4/5 [00:06<00:01, 1.98s/it]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][A
Loading checkpoint shards: 50%|█████ | 1/2 [00:05<00:05, 5.55s/it][A
Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00, 5.17s/it][A
Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00, 5.23s/it]
Loading pipeline components...: 100%|██████████| 5/5 [00:16<00:00, 4.82s/it]
Loading pipeline components...: 100%|██████████| 5/5 [00:16<00:00, 3.36s/it]
0%| | 0/50 [00:00<?, ?it/s]
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/tnt3530/Documents/CogVideo/provided.py", line 28, in <module>
latents = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50, output_type="latent", return_dict=False)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 629, in __call__
noise_pred = self.transformer(
^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 326, in forward
hidden_states, encoder_hidden_states = block(
^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 123, in forward
attn_output = self.attn1(
^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 490, in forward
return self.processor(
^^^^^^^^^^^^^^^
File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 2216, in __call__
hidden_states = F.scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 35.31 GiB. GPU 1 has a total capacity of 31.98 GiB of which 24.15 GiB is free. Of the allocated memory 4.58 GiB is allocated by PyTorch, and 481.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
EDIT: SDPA is disabled on AMD cards because they decided that only cool kids can play with fun toys https://github.com/pytorch/pytorch/issues/112997
Updating to torch 2.5.0 nightly makes it attempt to allocate 70.63GB (35 per gpu now). Forcefully disabling SDPA via torch.backends.cuda.enable_flash_sdp(False) doesnt help either
Yeah quite unfortunate that AMD doesn't support SDPA :( From the error logs, it seems like that's the only bottleneck that's making it not possible to run Cog for you. Let me know if you find any issues on the Diffusers side-of-things that I can help with
Thank you for your great work. I has question, It can well be inferenced by multi-gpus(RTX 3060Ti) or must one or more gpus >= 20GB? Thank you
must one or more gpus >= 20GB beacuse of VAE.
must one or more gpus >= 20GB beacuse of VAE.
Thank you, Does have any methods to splite VAE module?
Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU
Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU Thank you, If has any processing in tied vae, pls inform.
Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU Thank you, If has any processing in tied vae, pls inform.
try install diffusers and accelerate libs from source and check cli demo in CogVideoX-dev branch now, it will cost 12GB only for infer
did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()
Hi,Why do I still need ~36GB of GPU mem even though I set pipe.enable_model_cpu_offload()?
did you follow with cli_demo.py code and using NVIDIA Ampere or higher GPU like 3090 4090
did you follow with cli_demo.py code and using NVIDIA Ampere or higher GPU like 3090 4090
I use nvidia A100 to run. and demo code is from huggingface. https://huggingface.co/THUDM/CogVideoX-2b
you can try reinstall the diffusers and accelerate libs from source, and A100 must work with using infersence/cli_demo.py in this github repos
We have updated the latest repository, dependencies can now be downloaded from pip. Updating dependencies and retrying cli_demo can solve the problem. If there are any new issues, we can open a new issue