CogVideo CUDA OOM and possible solution -- diffusers cli

System Info / 系統信息

Thanks very much for releasing this great work!

In case this helps anyone else:

The diffusers cli_demo.py raised the CUDA OOM error below on an RTX 3090 with 24GB of VRAM using this command:

python cli_demo.py --prompt "A fish swimming underwater through a colorful coral reef. Sun is shining brightly through the water. It is a beautiful scene suitable for use in an eye-catching television advertisement." --model_path THUDM/CogVideoX-2b --num_inference_steps 50

... but works with barely enough free VRAM with this small adjustment -- set the env var PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True prior to running:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python cli_demo.py --prompt "A fish swimming underwater through a colorful coral reef. Sun is shining brightly through the water. It is a beautiful scene suitable for use in an eye-catching television advertisement." --model_path THUDM/CogVideoX-2b --num_inference_steps 50

CUDA OOM error:

    return torch._C._nn.pad(input, pad, mode, value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.66 GiB. GPU 0 has a total capacity of 23.48 GiB of which 1.53 GiB is free. Including non-PyTorch memory, this process has 21.81 GiB memory in use. Of the allocated memory 18.92 GiB is allocated by PyTorch, and 2.58 GiB is reserved by PyTorch but unallocated.

nvidia-smi before running -- ony 135MiB VRAM used:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:09:00.0 Off |                  N/A |
|  0%   32C    P8             17W /  350W |     135MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

nvidia-smi while running with PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True -- 23GB used but no CUDA OOM.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:09:00.0 Off |                  N/A |
|  0%   40C    P2            157W /  350W |   23335MiB /  24576MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

This was using a fresh conda environment with dependencies installed by pip from requirements.txt (side note: please include imageio in requirements.txt and also open the version range for opencv-python to >=4.10 as 4.10 was yanked from upstream)

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Details above.

Expected behavior / 期待表现

Details above.

Aug 07 '24 16:08 cktlco

The generated video is great, by the way -- keep up the good work!

https://github.com/user-attachments/assets/047d3340-5e01-4120-a247-e76668a1703d

Aug 07 '24 16:08 cktlco

So, have you solve your OOM problem? When we test the demo, it needs 23.9GB in fact. It may be a bit extreme, causing occasional oom.

Aug 07 '24 18:08 tengjiayan20

Yes, setting this environment variable before running solved it for me:

PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True

and I see it using a little over 20GB with that setting with nvidia-smi:

20423MiB /  24576MiB

Aug 07 '24 18:08 cktlco

I am getting OOM using 4 32GB GPUs. Using device_map="balanced" seems to split across 3 of the cards, before throwing the OOM error.

Aug 08 '24 00:08 TNT3530

did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()

Aug 08 '24 03:08 zRzRzRzRzRzRzR

did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()

this OOM's

as does this

Both attempts attempt to allocate the ~36GB This also happens when swapping to sat/inference.sh with the sample .txt

Aug 08 '24 03:08 TNT3530

PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True Try this. And what is your nvidia driver and GPU? V100 I guess. It should work( Although we only test in 3090 and A100)

Aug 08 '24 03:08 zRzRzRzRzRzRzR

PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True Try this. And what is your nvidia driver and GPU? V100 I guess. It should work( Although we only test in 3090 and A100)

Also does it with that set for both multi and single attempts GPUs are AMD Instinct MI100s, ROCm 6.0 I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(

Aug 08 '24 03:08 TNT3530

you can use torch. 2.2 2.3 2.4 both not work right?

Aug 08 '24 09:08 zRzRzRzRzRzRzR

you can use torch. 2.2 2.3 2.4 both not work right?

Torch 2.2.2, 2.3.1, and 2.4.0 all fail with the same attempted memory usage of ~36GB

Aug 08 '24 12:08 TNT3530

Just an FYI: If you install accelerate from the branch in following PR, the Diffusers demo runs in ~18 GB. Context: https://github.com/huggingface/accelerate/pull/2994#issuecomment-2276734638

Code

import gc

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


flush()

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)

pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]

torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")

export_to_video(video, "output.mp4", fps=8)

The PR will be merged into accelerate main hopefully soon. If you cannot or do not want to, for some reason, use accelerate from the dev branch, you could do the following:

Code without accelerate dev branch requirement

import gc

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


flush()

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)

pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16).to("cuda")
latents = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50, output_type="latent", return_dict=False)[0]

pipe.transformer.to("cpu")
pipe.text_encoder.to("cpu")
torch.cuda.synchronize()
torch.cuda.empty_cache()

with torch.no_grad():
    video = pipe.decode_latents(latents, num_seconds=6)
    video = pipe.video_processor.postprocess_video(video=video, output_type="pil")

torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")

export_to_video(video, "output.mp4", fps=8)

Additionally, you can play around with the device_map parameter if you have multiple GPUs, quantize text encoder or full transformer. Denoising only requires about 12-14 GB memory (if using cpu offloading) but it's the VAE that takes the most amount of memory (1 GB model + 17 GB decoding). We are working on figuring out tiled-decoding but nothing promising yet. I would imagine that CogVideoX would be runnable on a free-tier T4 or lower if someone can figure out tiling so if anyone's got ideas, feel free to PR at Diffusers

Aug 08 '24 22:08 a-r-r-o-w

Just an FYI: If you install accelerate from the branch in following PR, the Diffusers demo runs in ~18 GB. Context: huggingface/accelerate#2994 (comment) Code

The PR will be merged into accelerate main hopefully soon. If you cannot or do not want to, for some reason, use accelerate from the dev branch, you could do the following: Code without accelerate dev branch requirement

Additionally, you can play around with the device_map parameter if you have multiple GPUs, quantize text encoder or full transformer. Denoising only requires about 12-14 GB memory (if using cpu offloading) but it's the VAE that takes the most amount of memory (1 GB model + 17 GB decoding). We are working on figuring out tiled-decoding but nothing promising yet. I would imagine that CogVideoX would be runnable on a free-tier T4 or lower if someone can figure out tiling so if anyone's got ideas, feel free to PR at Diffusers

Running your provided non-dev-branch code and only swapping out the model for the HF THUDM/CogVideoX-2b still OOMs with the same numbers.
Adding PYTORCH_NO_MEMORY_CACHING=1 and pipe.enable_model_cpu_offload() also OOMs
Installing from source after cloning huggingface/accelerate and checking out test-clear-memory-cpu-offload, running pip install . and executing, also OOMs
Doing all of the above combined but adding device_map="balanced" also OOMs

Aug 08 '24 22:08 TNT3530

I am perfectly able to run the 2nd example above on an A4500 (20GB) with a fresh Pytorch 2.3 install and diffusers:main. We'll have to try and debug what's going wrong in your setup.

You mention:

I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(

Can you paste the error stack trace here? Would like to know at what point it's failing. If it's failing somewhere in attention, it's probably because you're unable to use FA2 - which is necessary to be able to run with low memory. Can you try setting up pytroch so that it allows you to run FA2?

Aug 08 '24 23:08 a-r-r-o-w

I am perfectly able to run the 2nd example above on an A4500 (20GB) with a fresh Pytorch 2.3 install and diffusers:main. We'll have to try and debug what's going wrong in your setup.

You mention:

I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(

Can you paste the error stack trace here? Would like to know at what point it's failing. If it's failing somewhere in attention, it's probably because you're unable to use FA2 - which is necessary to be able to run with low memory. Can you try setting up pytroch so that it allows you to run FA2?

Installed flash_attn 2.0.4 built from source, not sure how to force torch to let me use it

Output

/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/cuda/memory.py:343: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
The config attributes {'mid_block_add_attention': True} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]The config attributes {'mid_block_add_attention': True} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.

Loading pipeline components...:  20%|██        | 1/5 [00:00<00:03,  1.03it/s]
Loading pipeline components...:  60%|██████    | 3/5 [00:01<00:00,  3.24it/s]
Loading pipeline components...:  80%|████████  | 4/5 [00:06<00:01,  1.98s/it]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s][A

Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.55s/it][A

Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.17s/it][A
Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.23s/it]

Loading pipeline components...: 100%|██████████| 5/5 [00:16<00:00,  4.82s/it]
Loading pipeline components...: 100%|██████████| 5/5 [00:16<00:00,  3.36s/it]

  0%|          | 0/50 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/tnt3530/Documents/CogVideo/provided.py", line 28, in <module>
    latents = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50, output_type="latent", return_dict=False)[0]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 629, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 326, in forward
    hidden_states, encoder_hidden_states = block(
                                           ^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 123, in forward
    attn_output = self.attn1(
                  ^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 2216, in __call__
    hidden_states = F.scaled_dot_product_attention(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 35.31 GiB. GPU 1 has a total capacity of 31.98 GiB of which 24.15 GiB is free. Of the allocated memory 4.58 GiB is allocated by PyTorch, and 481.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

EDIT: SDPA is disabled on AMD cards because they decided that only cool kids can play with fun toys https://github.com/pytorch/pytorch/issues/112997

Aug 09 '24 00:08 TNT3530

Updating to torch 2.5.0 nightly makes it attempt to allocate 70.63GB (35 per gpu now). Forcefully disabling SDPA via torch.backends.cuda.enable_flash_sdp(False) doesnt help either

Aug 09 '24 01:08 TNT3530

Yeah quite unfortunate that AMD doesn't support SDPA :( From the error logs, it seems like that's the only bottleneck that's making it not possible to run Cog for you. Let me know if you find any issues on the Diffusers side-of-things that I can help with

Aug 09 '24 14:08 a-r-r-o-w

Thank you for your great work. I has question, It can well be inferenced by multi-gpus(RTX 3060Ti) or must one or more gpus >= 20GB? Thank you

Aug 11 '24 15:08 zhanghang1995

must one or more gpus >= 20GB beacuse of VAE.

Aug 11 '24 15:08 zRzRzRzRzRzRzR

must one or more gpus >= 20GB beacuse of VAE.

Thank you, Does have any methods to splite VAE module?

Aug 12 '24 00:08 zhanghang1995

Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU

Aug 12 '24 08:08 zRzRzRzRzRzRzR

Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU Thank you, If has any processing in tied vae, pls inform.

Aug 14 '24 11:08 zhanghang1995

Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU Thank you, If has any processing in tied vae, pls inform.

try install diffusers and accelerate libs from source and check cli demo in CogVideoX-dev branch now, it will cost 12GB only for infer

Aug 14 '24 12:08 zRzRzRzRzRzRzR

did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()

Hi，Why do I still need ~36GB of GPU mem even though I set pipe.enable_model_cpu_offload()?

Aug 15 '24 13:08 QAQEthan

did you follow with cli_demo.py code and using NVIDIA Ampere or higher GPU like 3090 4090

Aug 17 '24 03:08 zRzRzRzRzRzRzR

did you follow with cli_demo.py code and using NVIDIA Ampere or higher GPU like 3090 4090

I use nvidia A100 to run. and demo code is from huggingface. https://huggingface.co/THUDM/CogVideoX-2b

Aug 23 '24 11:08 QAQEthan

you can try reinstall the diffusers and accelerate libs from source, and A100 must work with using infersence/cli_demo.py in this github repos

Aug 23 '24 12:08 zRzRzRzRzRzRzR

We have updated the latest repository, dependencies can now be downloaded from pip. Updating dependencies and retrying cli_demo can solve the problem. If there are any new issues, we can open a new issue

Aug 28 '24 13:08 zRzRzRzRzRzRzR

CUDA OOM and possible solution -- diffusers cli_demo.py with Nvidia 3090 24GB

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现