CogVideo Question: How to run on AMD GPUs? How to enable multi GPU use?

Hi,

thank you for releasing this exciting project! I am trying to run inference on three Instinct MI25 cards with 16GB each, however I end up in an out of memory situation.

For the 2B model: torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 35.31 GiB. GPU

For the 5B model it's the same, but there it tries to allocate over 70GB.

I am using cli_demo.py and I tried commenting out the CPU offload lines:

--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -53,8 +53,8 @@ def generate_video(
 
     # 3. Enable CPU offload for the model, enable tiling.
     # turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
-    pipe.enable_model_cpu_offload()
-    pipe.enable_sequential_cpu_offload()
+    #pipe.enable_model_cpu_offload()
+    #pipe.enable_sequential_cpu_offload()
     pipe.vae.enable_slicing()
     pipe.vae.enable_tiling()

With the cpu lines not commented out I do not see other GPUs participating in anything, only GPU0 shows some activity before I run into the OOM situation. with the cpu lines commented out I see that CPU usage gets to 100% while all three GPUs stay idle.

I am somewhat confused, perhaps I am misinterpreting the data in the following table:

Should inference work with 16GB cards? Is there anything else that I am missing?

Sep 01 '24 21:09 jin-eld

Hi, regarding the second piece of information, I don't really know who sent it. Our team has not tested AMD's GPU devices. The above optimization was only tested on NVIDIA recently and has not been tested on AMD. If it is not an NVIDIA GPU, we recommend using SAT because we cannot ensure that all types of diffusers are compatible with AMD. You can follow the latest algorithm support for diffusers. Currently, the minimum memory occupied by NVIDIA GPUs is 5GB

Sep 02 '24 01:09 zRzRzRzRzRzRzR

Hi,

regarding the second piece of information, I don't really know who sent it.

GitHub has already removed that account and deleted the message after my complaint, you probably did not see it, because they acted quickly :) Anyway...

If it is not an NVIDIA GPU, we recommend using SAT because we cannot ensure that all types of diffusers are compatible with AMD. You can follow the latest algorithm support for diffusers. Currently, the minimum memory occupied by NVIDIA GPUs is 5GB

I wanted to try it, but so far failed to install it due to the deepspeed dependency, which again had some issues compiling for ROCm. Is deepspeed required for inference, or is it only something used for training?

Diffusers generally does work with ROCm, but perhaps something is indeed different, since I understand, that 5GB should be enough.

I'll try to sort out the deepspeed issue and will see if I can get SAT to work on my system, I'll report back here.

If there are any other AMD users out there who had more luck, please share your setup! :)

Sep 02 '24 08:09 jin-eld

It is only used for training, inference can be done without using DeepSpeed.

Sep 04 '24 11:09 zRzRzRzRzRzRzR

I tried the SAT variant, unfortunately without much success :(

I first went through the setup steps as described in the SAT readme and downloaded the required files for the 2B version, then edited the yaml files.

I then had to modify inference.sh, because it is hardcoded to the 5B model:

-run_cmd="$environs python sample_video.py --base configs/cogvideox_5b.yaml configs/inference.yaml --seed $RANDOM"
+run_cmd="$environs python sample_video.py --base configs/cogvideox_2b.yaml configs/inference.yaml --seed $RANDOM"

Next I had a module import error - it seems that triton is required, but either I have missed it or it is not listed in requirements.txt? pip install triton fixed that, but I then had to manually patch out the import and arg parsing for deepspeed since I was not yet able to install DeepSpeed on my system.

After that I was able to run the inference script, which only showed activity on GPU 0, the two remaining GPUs stayed idle. As a result I ran into an OOM:

[rank0]: torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 35.31 GiB. GPU

Is my understanding correct, that the script was supposed to spread the model across all available GPUs and thus utilize the memory of all three cards?

PyTorch does see all of my cards, so the overall setup should be OK:

>>> import torch
>>> for i in range(torch.cuda.device_count()):
...    print(torch.cuda.get_device_properties(i).name)
... 
AMD Radeon Instinct MI25
AMD Radeon Instinct MI25
AMD Radeon Instinct MI25

Do you have any ideas on how to figure out why CogVideo is not utilizing all three cards? Is there anything in the code I could try to modify or perhaps print some debugging info which would help us to find the problem?

Sep 04 '24 21:09 jin-eld

Oh, the issue seems to be related to the device_map. In the SAT version, everything is loaded onto a single GPU, whereas in the diffusers implementation, setting device_map = "balanced" allows for even distribution across three GPUs.

Sep 06 '24 17:09 zRzRzRzRzRzRzR

The device_map = "balanced" option was a good hint, I totally missed that! So retried the diffusers version, enabling the balanced option now indeed shows some activity on all three GPUs, but overall - no luck :(

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 35.31 GiB. GPU  has a total capacity of 15.98 GiB of which 10.98 GiB is free.

It is still not clear to me why it is trying to allocate 35GB, which is way more than listed in the table we talked about earlier? I found these issues though:

https://github.com/pytorch/pytorch/issues/112997 https://github.com/THUDM/CogVideo/issues/92

Sounds like I ran into the same problem?

Sep 06 '24 22:09 jin-eld

I don't think it's because of the balanced reason. I want to know if only one GPU was running the whole time (until the OOM error occurred), rather than multiple GPUs, or if the OOM was encountered during inference after the model had already been successfully loaded

Sep 07 '24 02:09 zRzRzRzRzRzRzR

I think it happens during inference, but I will rather paste the full log, I think you will have a better idea when you see it:

Loading pipeline components...:   0%|                     | 0/5 [00:00<?, ?it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:46<00:00, 23.21s/it]
Loading pipeline components...: 100%|█████████████| 5/5 [01:03<00:00, 12.78s/it]
  0%|                                                    | 0/50 [00:00<?, ?it/s]/home/user/.local/lib/python3.12/site-packages/diffusers/models/attention_processor.py:1925: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
  hidden_states = F.scaled_dot_product_attention(
  0%|                                                    | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/Work/CogVideo/inference/cli_demo.py", line 107, in <module>
    generate_video(
  File "/home/user/Work/CogVideo/inference/cli_demo.py", line 67, in generate_video
    video = pipe(
            ^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 687, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 458, in forward
    hidden_states, encoder_hidden_states = block(
                                           ^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 131, in forward
    attn_hidden_states, attn_encoder_hidden_states = self.attn1(
                                                     ^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 1925, in __call__
    hidden_states = F.scaled_dot_product_attention(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 35.31 GiB. GPU  has a total capacity of 15.98 GiB of which 10.98 GiB is free. Of the allocated memory 4.29 GiB is allocated by PyTorch, and 472.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The state of the GPUs when the OOM happened was as follows:

My naive interpretation was, that it may have to do with attention after all, the backtrace where it OOMs points to hidden_states = F.scaled_dot_product_attention( and PyTorch issues a warning in the beginning, saying that:

attention_processor.py:1925: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)

This is why I thought it may be related to https://github.com/pytorch/pytorch/issues/112997

Just to be sure I did not mess anything up, here is the patch of my changes for using the 2B model with multiple GPUs:

diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index 323e9af..85588a6 100644                                                   
--- a/inference/cli_demo.py                                                     
+++ b/inference/cli_demo.py
@@ -43,18 +43,19 @@ def generate_video(
     # add device_map="balanced" in the from_pretrained function and remove the enable_model_cpu_offload()
     # function to use Multi GPUs.

-    pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
+    pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype, device_map="balanced")

     # 2. Set Scheduler.
     # Can be changed to `CogVideoXDPMScheduler` or `CogVideoXDDIMScheduler`.
     # We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B and `CogVideoXDPMScheduler` for CogVideoX-5B.
     # pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
-    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+    #pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+    pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

     # 3. Enable CPU offload for the model, enable tiling.
     # turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
-    pipe.enable_model_cpu_offload()
-    pipe.enable_sequential_cpu_offload()
+    #pipe.enable_model_cpu_offload()
+    #pipe.enable_sequential_cpu_offload()
     pipe.vae.enable_slicing()
     pipe.vae.enable_tiling()

My command line is: python cli_demo.py --model_path=/home/user/Work/models/CogVideoX-2b --dtype=float16 --prompt "a fallow doe walking on mars"

Please let me know if there is anything I can debug to help narrow down the issue.

Sep 08 '24 00:09 jin-eld

This code is completely correct, but it seems that it cannot run on multiple AMDGPUs

pipe.enable_model_cpu_offload() pipe.enable_sequential_cpu_offload()

Can these two sentences enable you to run on an single AMD GPU? Regarding the multi-card issue you mentioned, perhaps it should indeed be addressed by the diffusers or PyTorch team. It cannot be handled at the moment because your model is in the normal loading stage (around 15GB), which is normal.

Sep 08 '24 01:09 zRzRzRzRzRzRzR

pipe.enable_model_cpu_offload() pipe.enable_sequential_cpu_offload()

Can these two sentences enable you to run on an single AMD GPU?

Unfortunately no, when I remove the "balanced" setting and let the above two lines in, then something keeps loading and loading until I run out of RAM (I have only 64GB), then the system starts swapping and becomes unresponsive. The GPUs stay idle and do not show any activity.

Regarding the multi-card issue you mentioned, perhaps it should indeed be addressed by the diffusers or PyTorch team. It cannot be handled at the moment because your model is in the normal loading stage (around 15GB), which is normal.

I am somewhat confused though, what does the table mean by "single GPU RAM consumption: 3.6GB"? Is this meant as "additional VRAM needed on top of the loaded model"?

So, to make sure I understand the issue correctly: I need to be able to load the whole model onto one GPU, this can not be split across multiple GPUs? And then inference could be split between multiple GPUs, but I never get that far, because I can't load the model? Is my understanding of the problem correct?

Sep 08 '24 01:09 jin-eld

No, the 3.3GB refers to the data obtained from testing on NVIDIA GPUs after using pipe.enable_sequential_cpu_offload(). In the data explanation, we also mentioned that the tests were conducted on A100/H100 GPUs, and these GPUs can function properly with this command on a single card. Unfortunately, in your test, AMDGPU did not work properly.

When pipe.enable_sequential_cpu_offload() is enabled, the model is not fully loaded onto the GPU, which allows for a significant reduction in VRAM usage by interacting between the GPU and CPU.

Sep 10 '24 00:09 zRzRzRzRzRzRzR

Aha... wait, so this means that using multiple GPUs and enabling CPU offload are mutually exclusive? Is there any potential way to have CPU offload together with the use of all available GPUs?

My suspicion is that pipe.enable_sequential_cpu_offload() generally did work for me, but the 64GB of system RAM were not enough and my system swapped itself to death before I could see any GPU action. I wonder you see a way, that everything could be spread out between system RAM while still using the "balanced" option? In the meantime I'll see if I can order another 64GB for the system and test with enabled CPU offload again.

Sep 10 '24 10:09 jin-eld

Yes, it is mutually exclusive, only one can be chosen, 64GB of free RAM is sufficient, the issue does not lie here

Sep 11 '24 07:09 zRzRzRzRzRzRzR

For multi-GPU cards, due to the design in the diffuser library, pipe.enable_sequential_cpu_offload() must be disabled. This is not something that can be determined by the model layer and requires adherence to the design in the diffuser library.

Sep 11 '24 07:09 zRzRzRzRzRzRzR

guys is there a way to use selected gpu only? Config etc?

Sep 16 '24 13:09 WillyamBradberry

guys is there a way to use selected gpu only? Config etc?

I think you can simply control the visibility of your GPUs using the CUDA_VISIBLE_DEVICES environment variable to point CogVideo to the specific GPU which you want to use. I think this should work.

Sep 16 '24 13:09 jin-eld

guys is there a way to use selected gpu only? Config etc?

I think you can simply control the visibility of your GPUs using the CUDA_VISIBLE_DEVICES environment variable to point CogVideo to the specific GPU which you want to use. I think this should work.

thank you, will try

Sep 16 '24 21:09 WillyamBradberry

@jin-eld Did you get your AMD machine to function when you switched to a single GPU? I have the same OOM error except with a single 7900XTX.

Jan 05 '25 19:01 mpai17

@mpai17 no, 16GB on one card was not enough, so I ended up with OOM and gave up on this - too bad, would have loved to try it out.

Jan 05 '25 19:01 jin-eld

Yeah, same here. Do you know what is causing the memory allocation issue? I was interested in looking at the diffusers and potentially creating an AMD branch of CogVideoX.

Jan 05 '25 20:01 mpai17

I gave up on it eventually as I had to give up on many projects that were written with NVIDIA in mind and were always missing something or did not work correctly on AMD hardware.

I think the issue is somewhere in the diffusers library or even deeper, but I did not investigate it any further.

I wonder how much work it would be to reimplement CogVideo in Burn? Would surely be some effort, but I think that's doable, especially since the Python sources can be used as guidance.

Should you really dig into this - please drop me a note, so I can keep an eye on your progress, I would still very much like to run a video generation model locally.

Jan 05 '25 20:01 jin-eld

The issue is that F.scaled_dot_product_attention does not support any of the optimized attention implementations (like Flash Attention) on most AMD "consumer" cards. It tries to do a matrix multiplication with a huge batch size and allocates memory to store the entire result, causing the OOM.

Jan 05 '25 22:01 Exploder98

@Exploder98 thanks for the insight, actually Burn may indeed help here, they did not rely on foreign implementations, but they basically came up with https://github.com/tracel-ai/cubecl which is a layer over various hardware (including HIP support), as far as I understand they then implement optimized basic ops for each hardware type and the actual implementations of model layers and building blocks sit on top of that and thus remain compatible to all underlying hardware that is supported by CubeCL.

So they do not build on top of other AI libraries in the sense of wrapping someone else's C++ code for implementing - say Flash Attention and suffering from all the limitations that the original code has (like proper AMD support), but they implement their own.

Jan 05 '25 22:01 jin-eld

@jin-eld I found a workaround to the attention problem that allows you to run CogVideoX on AMD cards. You need to run CogVideo through ZLUDA, so I recommend installing https://github.com/patientx/ComfyUI-Zluda. Then you install the ComfyUI CogVideoX wrapper in custom nodes, and create workflow with attention_mode set to "comfy." This should resolve the issue with flash attention reserving an enormous amount of VRAM.

My assumption as to why this happens is that PyTorch’s built-in function attempts to compute the full attention matrix in one go. It assumes you have tensor cores on which to run flash attention, which siphons all the VRAM. I believe comfyui attention implementation breaks the computation into chunks that can be processed by ZLUDA translated HIP calls.

Feb 04 '25 10:02 mpai17

@mpai17 wow, thank you for sharing this! I did not know ZLUDA was still around, I thought it was kind of abandoned after AMD stopped financing it. Will try your suggestion for sure!

EDIT: oh nooo....

Windows-only version of ComfyUI which uses ZLUDA to get better performance with AMD GPUs.

it's Windows only :(

Feb 04 '25 10:02 jin-eld

Actually, you might not need to, I'm just running ZLUDA because PyTorch does not support ROCm on Windows. On Linux, ComfyUI should still theoretically work since it's all running on PyTorch. The main factor here is that the guy who made the ComfyUI CogVideoX wrapper added a separate attention implementation that does not call the native PyTorch one.

Feb 04 '25 10:02 mpai17

CogVideo CogVideo copied to clipboard

Question: How to run on AMD GPUs? How to enable multi GPU use?

CogVideo
CogVideo copied to clipboard